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ABSTRACT 



A storage control enables data on various storage media to 
be shared among host computers having various different 
host computer input/output interfaces. A control processor 
checks a host computer interface management table when 
write is requested by a host computer (HCP). The control 
processor writes write data in a cache slot of the cache 
memory without converting the format if the data format of 
the HCP is in an FBA formal and it converts the format into 
the FBA format and writes it when the data format of HCP 
is in a CKD format. The processor checks a table when read 
is requested by HCP. If the data format of HCP is FBA, the 
processor transfers the read data read from the cache slot 
without converting it and if it is CKD, the processor converts 
the format into the FBA format and transfers the format 
converted data. The control processors, retrieve write data in 
the cache memory and writes it in a drive. 

19 Claims, 9 Drawing Sheets 
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STORAGE CONTROL AND COMPUTER 
SYSTEM USING THE SAME 

BACKGROUND OF THE INVENTION 

The present invention relates to a storage control and in 
particular to a storage control system which enables data on 
various storage media for storing input/output data to/from 
host computers to be shared among said host computers 
having various host computer input/output interfaces; and a 
computer system using the same. 

Recently cases have been increased in which main frames 
are linked with a division system of an open system bases, 
such as downsizing of part of operations (transactions, jobs) 
which used to be processed by a main frame to a division 
server (for example, UNIX server, etc.) or incorporation of 
an information system into a division. 

In these cases, due to the fact that the data format (CKD 
format) of the main frame is different from the data formal 
(FBA format) of the host computer input/output interface of 
the UNIX server, development of programs for data con- 
version and data conversion between host computers is 
required or a storage control devoted to each host computer 
input/output interface is necessary. Trjis makes it difficult to 
build a wide range of computer system configurations. 

As one of methods which have been devised in order to 
overcome the above-mentioned problems, an integrated 
computer system which enables various programs to be 
executed without limiting the CPU (central processing unit) 
architecture by adopting, for example, a hardware configu- 
ration including a plurality of computers in one system is 
disclosed in J P-A- 60- 254 270. 

A magnetic disk device including an interface which is 
compatible to a plurality of different interface standards and 
an interface conversion control circuit which enables files on 
a magnetic disk to be shared is disclosed in JP-A-1-309117. 

SUMMARY OF THE INVENTION 
In the prior art technology which is disclosed in the 
above-mentioned JP-A-60-254270, CPUs having different 
architectures have a master-slave relationship with each 
other. The CPU on the other slave side having different 
architecture is exclusively prevented to simultaneously use 
the storage medium in order that the CPU on the slave side 
is selected by a hardware switch to occupy a system bus for 
executing an input/output operation to/from the storage 
medium. Accordingly when the selected CPU is used for an 
extended period of time, a disadvantage occurs that the 
storage medium and/or the system bus which are resources 
common to the system would be occupied for an extended 
period of time. 

Since files in the magnetic disk device are divided and 
stored for each of CPU sharing different architectures, it is 
impossible to share the same file in the magnetic disk device 
among CPUs having different architectures. 

Although it is possible for host computers having different 
architectures to share a storage medium in the prior art as 
mentioned above, the storage medium is exclusively used 
among host computers having different architectures. This 
may partly cause the utilization efficiency of the file sub- 
system to be remarkably lowered. The disadvantage that 
data sharing among host computers having different host 
computer input/output interfaces is not overcome. 

Although file sharing among host devices having different 
interfaces is possible in the above-mentioned JP-A-1- 
309117, data sharing among different interfaces is not men- 
tioned. 
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It is an object of the present invention to provide a storage 
control in which a request of data access to a storage 
medium from host computers having different host computer 
input/output interfaces is made possible by conducting data 
conversion if it is necessary for such a request and in which 
sharing of data on the storage medium among host comput- 
ers having different host computer input/output interfaces is 
made possible whereby extensibility of file subsystem and 
responsiveness of data is enhanced to enable a wide rage of 
computer system configurations to be built. 

It is another object of the present invention to provide a 
storage control which enables various host computer input/ 
output interfaces and/or various storage medium input/ 
output interfaces to be added to or removed from the storage 
control. 

It is a further object of the present invention to provide a 
computer system including the above-mentioned storage 
control. 

In order to accomplish the above-mentioned object, in an 
aspect of the present invention there is provided a storage 
control in a computer system including a plurality of host 
computers having various different host computer input/ 
output interfaces, the storage control for controlling input/ 
output to/from the host computers, and various storage 
media for storing input/output data of the host computers, 
wherein the storage control is made up of: control 
processors, each being connected to one of host com- 
puters; device input/output interfaces, each for one of 
storage media, for connecting the control processors 
with the various storage media to input/output data in 
a predetermined format to/from each storage medium; 
and a host interface management table for managing 
the data format of the host computer interface of each 
host computer; 
each of the control processors includes a data format 
converting unit which compares the data format of 
corresponding host computer in the host computer 
interface managing table with the predetermined data 
format when a write data request is issued from the 
corresponding host computer, convertes the data format 
of the write data into the predetermined data format, 
writes the converted write data in the storage medium 
when the compared formats match with each other and 
writes the write data without converting the data format 
of the write data when they do not match with each 
other, and which also compares the data format of the 
corresponding host computer in the host computer 
interface managing table with the predetermined data 
format when a read data request is issued from the 
corresponding host computer, converts the data format 
of the read data read from the storage medium into the 
predetermined data format to transfer the converted 
read data to the corresponding host computer when the 
compared formats do not match with each other and 
transfers the read data to the corresponding host com- 
puter without converting the data format of the read 
data when they match with each other. This configu- 
ration enables data on various storage media to be 
shared among host computers having different host 
computer input/output interfaces. 
In a computer system having the storage control of the 
above-mentioned configuration, addition or removal of one 
or more host computers having desired kinds of host com- 
puter input/output interfaces and one or more control pro- 
cessors compatible to the host computers is made possible 
by updating the host computer interface managing table. 
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In accordance with one feature of the present invention, medium 203 are connected to the storage control 210, which 
drive control blocks (DCB), each for one of storage devices includes a data access unit 202. An access request 204 

(drives) are provided for controlling the access to each (write), 205 (read), 206 (write) and 207 (read) which are 

storage device. This enables various storage devices having generated by the host computers A200 and B201 to the 

a given device input/output interface to be added or 5 storage media 203 are issued to the storage control 210. 

removed. These access requests are executed by the data access unit 

202. 

BRIEF DESCRIPTION OF THE DRAWINGS If the write access request 204 from the host computer 

FIG. 1 isaschematicblockdiagramshowingoneembodi- ^^1^' ^ h*? ^ ^ ^1 

ment of a computer system of the present invention; « ^^J^ D ^ ^ ™™T 0Q °[ Wnle da ! a * 

„_ — . _ m ' necessary and writes converted data od the storage medium 

RG. 2 is a schematic block diagram which is useful for 2 03 and write unconverted write data on the storage medium 

explaining the summary of the present invention; 203 when the data conversion is not necessary. 

FIG. 3 is a schematic block diagram showing the con- If the write access data request 204 is issued from the host 

figuration of another embodiment of a disk subsystem of the 35 computer A200, the data access unit 202 performs data 

present invention; conversion of the write data when it is necessary and writes 

FIG. 4 is a diagram showing the configuration of a drive the converted data on storage medium 203 and writes 

control block; unconverted write data on the storage medium 203 when the 

FIG. 5 is a diagram showing one example of host com- dala conversion is not necessary (208). 

puter interface management information; 20 If lDC rea d access request 205 from the host computer 

FIG. 6 is a flow chart showing a data access operation; A20 ° fe issued ' dala acccss unit 202 rcads data from lhc 

FIG. 7 is a flow chart showing a DCB reserving operation; *™J'^™ aDd f pcrfo " ns dala jf^ 0 " * * 

r~n- o • o t_ ■ . ^™ , . . necessary and transfers the converted data to the host 

FIG. 8 is a flow chart showing a DCB releasing operation; computer A Md traQsfef5 ^ ^^tatd read data to the 

FIG. 9 is a chart useful for illustrating the conversion 25 host computer A when the data conversion is not necessary 

between CKD and BAD formats; and Processing for the access request from the host computer 

FIG. 10 is a schematic block diagram showing the con- B201 is conducted similarly to the processing for the access 

figuration of a further embodiment of a disk subsystem of request from the host computer A200. 

the present invention. The above-mentioned operation enables the host coraput- 

DESCRIPTION OF THE PREFERRED 30 erS havin £ various different host computer input/output 

EMBODIMENTS interfaces to share the dala on the storage medium. 

It is to be noted that the number of the host computers 

Now, the embodiments of the present invention will be having different host computer input/output interfaces is not 

described in detail with reference to drawings. limited to 2, the three or more host computers may be 

FIG. 1 is a block diagram which is useful for explaining 35 connected, 

the principle of the operation of the present invention and FIG. 3 is a block diagram showing the configuration of a 

showing the configuration of an embodiment of a computer disk subsystem having a cache memory in another embodi- 

system of the present invention which comprises a plurality ment of the present invention. 

of host computers having various different input/outputs i n rg. 3, a disk control 302 is connected to a host 

therefor, a storage control for controlling the input/output 40 computer 300 via a channel control 301 on the host side and 

to/from the host computers and various storage media for is also connected to a host computer 303 via a small 

storing therein input/output data to/from the host computers. computer system interface (abbreviated as SCSI). 

In FIG. 1, the plurality of host computers A100, B101 and i n the present embodiment, the host computer 300 is a 

C102 having various different host computer input/output main frame computer (CKD data format) and the host 

interfaces are connected to a magnetic disc device 111, 45 computer 303 is an UNIX computer (FBAdata format), 

magnetic tape device 112 and floppy disk device 113 via a disk 302 connected to drives 315 and 316 

storage control 103. which m magnelic borage medium on the lower side. 

Control processors 104, 105, 106, 108, 109 and 110 which The disk control 302 performs read/write of data on the 

arc incorporated in the storage control 103 are adapted to drives 315 and 316 in response to the requests of the host 

control the transfer of data among the host computers A100, computers 300 and 303 

B101 and C102 and the magnetic disk device 1U, magnetic Dala among thc hosl compulcrs 300, 303 and the 

tape device 112 and the floppy disk device 113. drivcs 315 316 fa by lbc ^ nt|0 , processors 305, 

In other words, the control processors 104, 105 and 106 306, 310 and 311 which are incorporated in the disk control* 

execute an input/output data transfer request from the host 55 302. 

fS^S* A i°?^ 1 ° 1 aDd 0102 3Qd the r comro1 Pressors The control processors 305 and 306 are connected to the 

108,109 and 110 execute a request of input/output data host computers 300 and 303 through the channel control 301 

transfer to the magnetic disk device 111, magnetic tape and lhe SCSI bus 304( respectively and the control 

device 112 and the floppy disk device 113. processors 310 and 311 are connected to the drives 315 and 

All the control processors 104, 105, 106, 108, 109 and 110 60 316 through the drive interfaces 313 and 314, respectively, 

receive and transmit data and control signal with each other The control processors 305 and 306 mainly perform data 

via a control line 107. Uaissht ^twecn the host computers 300 and 303 and the 

FIG. 2 is a block diagram for explaining the outline of the cache memory. 309. The control processors 310 and 311 

present invention. Now, the outline of the present invention mainly perform data transfer between the cache memory 309 

will be described with reference to FIG. 2. 65 and the drives 315 and 316. 

lnnG.2, the host compulersA200,B201 having different A common control memory 307 is a common memory 

hosl computer input/output interfaces and the storage which is accessible from all control processors 305, 306, 310 



12/11/2003, EAST Version: 1.4.1 



5,920,893 

5 6 

and 311 and stores therein common control information 318 Now, operation of the control processors 305, 306, 310 

used for allowing the disk control 302 to manage the drives and 311 in the disk control 302 in accordance with the 

315 and 316. Hie common control information 318 will be present invention will be described, 

described hereafter in detail. FIG. 6 is a flow chart showing the main flow of operation 

The cache memory 309 is accessible from all control 5 which is executed by a data access processing unit (600) 

processors 305, 306, 310 and 311 and is used for temporarily including a data format conversion unit, 

storing dati .which* read from the drives 315 and 316 A Whcn a contro| ^ m a ^ acccss 

cache slot 312 is the data management unit quantity in the command from the host computer 300, DCB processing for 

cache memory 309. reserving the access right for the DCB of the specified drive 

The control processors 305, 306, 310 and 311 receive/ 10 number is executed (step 601) 

Uansmil data and control signals from/to -the cache memory A determination as to whether reservation of DCB is 

309 and the common control memory 307 via a signal line succe eded or not is made (step 602). If failed, the data access 

* processing is ended (step 616). If it is successful, following 

The control processors 305 and 306 are connected to a operation will be conducted. 

XT ^^ proc f ssor 317. Firstly, reservation of the cache slot is conducted (step 

When update of the common control information 318 in 60 3). Subsequently, a determination is made as to whether or 

the common control memory 307 is instructed from the not the data access command is a write command or a read 

service processor 317, the service processor 317 selects any command (step 604). 

one of the control processors 305 and 306 to send an update , n If . , „ . ' , . .... . _ , 

r^t.^i iu . , .„ , f 20 If it is a write command, operation will be conducted as 

request, so that the selected control processor will update the follows- 

common control information 318 in the common control 

memory 307 With reference to the host computer interface manage- 

Now, the common control information will be described. ™ Dl iD . forma J ion »»ble 500 (FIG. 5) (step 610), a determi- 

,. r nation is made as to whether the host computer interface 

?^ C K^wn ft C0^ I^ 0 u ,nformatl0 ° 31 * incIudcs a dnve 25 which is connected to the control processor is in the CKD or 

control block 400 and host computer interface management pg A format (step 611) 

information table 500, which will be described in order. , . . .. 

a u a * | *. . * * In lne present embodiment, the control processors 305 

FIG. 4 shows a drive control block (abbreviated as DCB) „ nA -in* ' * t L a r u n TOA F . r en , , - A _ 

aim rw nrD Ann ■ a a c t_ c j - i and 306 are 10 tDe and FBA formats 506 and 507, 

400. One DCB 400 is provided for each one of the drives and respectively 

stores therein four data. 30 J?. . . „ 

n,«fn,.^.i. • i i . . . c ... If it is in the CKD format, the control processor 305 

The four data include a dnve number 401 for enabling the „„„„„„ r*vr\ a . • ♦ ra A a . / . . ■ ^ 

a: c u mo , rt Mo t f u A • ■ . i converts CKD data into FBAdata (step 612) and thereafter 

disk control 302 to identify each drive, lnterprocessorexclu- „ tril » e t u„ j. • # \ u u i.-»t<w. 

sive data 402, interhost exclusive information 403 and drive ™ ft ™f ° h thC Sbt 312 ^ 

vacancy waiting information 404. 613) ' U ** m , lhe , ™ A format ' *c conttol processor 305 

_ & , . . c 1S wntes FBA data into the cache slot 312 without conducting 

The interproccssor exclusive information 402 is used any conversion. Conversion of FBA data into CKD data will 

when the control processor 305 or 306 exclusively controls 5e dcscrib ed hereafter Thereafter, the cache slot is released 

the DCB access from the other control processor and sets the (step 614) and DC B is released (step 615). 

processor number when the control processor reserves lhe lf t . . . . . , 

™u. t~ *u r\r*T» l .i_ c j j ■ i_ It the received command is a read command, operation 

access right for the DCB having the specified drive number .... . , r ., ' ^ 

„ . _ i .i r i , L . 40 W1 l» °e executed as fol ows: 

and cancels the processor Dumber when the control proces- 

sor releases the access right to the DCB Firstly, a determination is made as to whether a read data 

The interhost exclusive information 403 is used when the existS ! Q lhe ca ^he memory (step 617). If the data exists, 

host computer 300 or 303 exclusively controls the drive operation at step 605 and the subsequent steps will be 

access from the other host computer and sets the information conducled - If 00 data ex **' Ration (1) will be conducted. 

403 "on" when the processor reserves access of the specified 45 In tnc operation (1), the control processors 305 and 306 

drive number and sets it "ofT when the host computer instruct the control processors 310, 311 to read data from the 

releases the access right. drives and the control processors 310, 311 read data from the 

The drive vacancy waiting information 404 is used to ^ v * a mc ******* 313 > 314 to write read data in 

inform one of the host computers 300 and 304 that a DCB <n lhc cache s!ot of lhc cacbc memorv - This operation is 

having the drive number specified is vacant upon being 50 omiltcd m lbe flow chart of nG * 6 * 

released from use by the other host computer which has been Now » operation at step 605 and the subsequent steps will 

using the DCB when the one computer requested to reserve De described. 

the access right for the DCB but was informed of the fact Data on the cache slot 312 is read (step 605). 

that the DCB was being used by the other host computer. 55 With reference to the host computer interface manage- 

FIG. 5 shows a host interface management information ooent information table 500 (FIG. 5) (step 606), a determi - 

table 500. nation is made as to whether the host computer interface 

In the table 500, the host interface information 502 is which is connected to the control processor is in the CKD or 

managed for each number 501 of the control processor of a FBA format (step 607). 

host computer connected. It is determined based upon this 60 If it is in the CKD format, the control processor 305 

information as to whether data conversion is to be con- converts" the read data from FBA format into CKD format 

ducted- data (step 608) and transfers the converted data to the host 

In the present embodiment, the control processors 305 computer 300 (step 609). Conversion of FBD data into CKD 

(503 in FIG. 5) and 306 (504 in FIG. 5) manage the CKD data will described later. 

and FBA formats 506 and 507, respectively. The present 65 If it is in the FBA format, the data which is read out from 

management information table 500 is set and reset in the cache slot 312 is transferred to the host computer 303 

response to an instruction from the service processor 317. without conducting any conversion. 
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Subsequently, the cache slot 312 is released (step 614) and end of the positioned blocks. The remaining portion of the 

DCB is released (step 615). block is left as it is. In the case of data input/output to/from 

The write data on the cache slot 312 is written into the host computer having FBA format, the specified LBA 

drives 315, 316 by the control processors 310, 311 asyn- (Logical Block Address) is accessed from the host computer, 

chronously with the operation of the host computers 300, 5 M _, _ f „ OA f , r 

303. In other words, the control processors 310, 311 retrieve . ^^T^T * 

the cache memory and when data to be written into the drive WUi descnbed - 

is found in the cache memory, it writes the data into the In the case of data input/output to/from a host computer 

dnvc - having CKD format, a block having the C information of 

FIG. 7 is a flow chart showing the flow (700) of operation interest is searched and accessed from the specified C 

in DCB reservation operation (601) in FIG. 6. information. 

Interprocessor exclusive information 402 of the DCB 400 In the case of data input/output to/from a host computer 

corresponding to the specified drive number 401 is preset having FBA format a block of specified LBA is accessed 

(step 701). 35 

c . • . . . ... Since I he method of conversion is well known, its detailed 

Subsequently, a determination is made as to whether the descri ption will be omitted herein, 
interbost exclusive information is "on" or not (step 702). 

If it is not "on", the interhost exclusive information 403 AW»ugh a magnetic disk device is used as a storage 
is set "on" (step 703) to set success of reservation in a return 20 medjum m lhc above-mentioned embodiments, the above- 
code (step 704). mentioned data access processing can be implemented by 

•jm ,» • , i • - f ™ . using a magnetic tape device or floppy disk device in lieu of 

, h ■ ,n ' e ;T ce T n « dusive ,nfonnatIOn <° 2 ,s magnetic disk device, 
canceled (step 708) to end DCB reservation operation (step 

709). The host computers may be added to or removed from the 

If the interhost exclusive information has already been 25 slora g e control or the computer system for each of control 

"on", the operation will be conducted as follows* processors for processing a request of input/output of the 

Hie fact that the DCB is being used is reported to the host *? T?^ 8 ^ 

computer (step 705) p to/froiD the slora ge media. Further, host computers 

may be added to or removed from the storage control of a 

Subsequently, the host computer to which the use of DCB 30 drivc (storage ) haviog a device inpu t/outpul interface. Such 

is reported is recorded in the DCB vacancy waiting infor- aa embodiment will be described with reference to FIG. 10. 

mauon 404 of DCB 400 (step 706). Since componenls like to those m FIG 3 are desigDated by 

Failure of reservation is set in the return code (step 707) like reference numerals, description of them will be omitted, 

and then the interprocessor exclusive information 401 is , iA . - CL , 

canceled (step 708). 35 ,nFIG - 10 > a P air ( 31 ^) of a fiber channel control 317 and 

o • a c °ntrol processor 318 for controlling the same are con- 

HG. 8 ls a flow chart showing the flow of DCB release necled (added) t0 or removed (ddeted) from a comm0Q bus 

processing (615) in FIG. 6. 308 of the disk COQlroI 302 

Now, the interprocessor exclusive information 402 of - . . c a % 

DCB 400 of the drive number 401 in which release is «o Fun * T > a P air SB ° f ' h ° m ^ (FD) 321 3Dd 3 

requested is set (step 801). control processor 320 for controlling the same are connected 

™ . . , . . r M (added) to or removed from the common bus 308 of the disk 

Inen, the interhost exclusive information 403 is canceled control 302 
(step 802). 

Then, a determination is made as to whether a host 45 , J* 6 P rocessor 317 updates the contents of the 

computer exists, which is registered in the vacancy waiting ™vc control block 318 in a common memory 307 and the 

information of DCB (step 803). nosl com P ute r interface management table 319 in response 

If , , . . .to addition or deletion of the pairs 319 and 322. 

If no host computer exists, the interprocessor exclusive 

information 402 is canceled (step 605) to end (step 806) the Id accordance with the present invention, data on storage 

drive release operation (step 800). 50 media can be shared by host computers in which storage 

If a host computer exists, vacancy of DCB is reported to conlr °k have various different host computer input/output 

the registered host computer (step 804). interfaces as mentioned in the foregoing embodiments. 

Th;c n r~„* n i~ .k nroi; u- ,1 . JL Exensionability of file subsystem and the responsiveness of 

This prevents the DCB from being predominantly used by data ^ enhanced, 

either one of the host computers. Then, the interprocessor 55 

exclusive information is canceled (step 805). Since host computers having various different host com- 

FIG. 9 is a chart for explaining the conversion between P # Utef T^ 0 ^ 01 interfaces or various storage media for 

CKD and FBA format data. stormg ^P 1 ^ 0 "^ 1 da t a of the host computers can be 

connected by a single storage control, a wide range of 

Referring now to FIG. 9, conversion from CKD format 6Q configurations of computer system are possible. 

data to FBA format data will be briefly described. What ^ claimed 

Information on which block CKD of (1) and (2) are 1. A storage control for use in a computer system includ- 
positioned is available. Information on data length of CKD ing a plurality of host computers having different kinds of 
is stored in a C (count) area. In the case of data input/output computer input/output interfaces; a storage control for con- 
to/from a host having CKD formal, the data length infor- 65 trolling data input/outpul to/from said host computers; and at 
mauon is divided by 512 bytes to determine the number of least one storage medium having a device input/outpul 
necessary blocks. The data is front-packed from the front interface for conducting data input/output in a given data 
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format for storing input/output data to/from said host 9. A computer system as defined in claim 8 in which the 

computers, number of said storage media is arbitrary changed to allow 

said storage control comprising: a desired number of storage media to be added or removed, 

a plurality of control processors for controlling the 10. A computer system as defined in claim 6 and further 

transfer of input/output data to/from the host 5 including a cache memory which is connected to said 

computers, each of said processors being connected slorage media ^ lhat i^t/output of data is conducted by 

to one of said plurality of host computers; and sajd host computers via tne cache memory, 

a management table for managing the data format of . . , „. , 

tbe input/output interface of eacb of said bost com- 11 A C0D,rol a PP aran,s for C0Dlroll '°g of 

p Uter; 10 data between a plurality of host computers and at least one 

each of said control processors including a data format storage medium which stores data of a particular data formal 

converting unit which, in response to a request of sa id particular data format being the same as a data format 

read or write of associated host computer, converts used by at least one of said bost computers, said storage 

the format of the read or write data into said given control apparatus comprising: 

data format when the data formal from the associated 15 , ... r . , . ... - 

u^* ^ * a . . . • , - , a plurality of control processors eacb controlling transfer 

host computer dose nol match said given data format *_,/,_ r LL , . 

and does not convert the format of the read or write of data bclwecD one of the hosl *™P*to* a ° d the 

data when they match. stora & e medium, 

2. A storage control as defined in claim 1 in which the wherein each control processor includes a data format 
content in said management table is set/canceled in response 20 converting unit which, in response to a request to write 
to another processor to allow a desired number of host dala in lhc a^ge me dium from a corresponding host 
reeved* ^ aSS ° dated C ° mro1 P rocessors t0 be added or computer, converts the dala format of the write data 

i A ' ., . c . . , . - jr. into the particular data format when the data format of 

3. A storage control as defined in claim 2 and further • • * % . . . . r 

including a drive control block for managing tbe state of 25 thc WnlC data docs not match lhc P artlcular data formal 

access to said storage media by said host computers. and does not convert lhe dala format of lhe wrUe dala 

4. A slorage control as defined in claim 3 in which the m{0 lne particular data format when lhe data format of 
number of said storage media is arbitrary changed to allow the write data matches the particular data format. 

a desired number of storage media is be added or removed. 12. A storage control apparatus according to claim 11 

5. A storage control as defined in claim 1 and further 30 wherein each dala format converting unit of each control 
including a cache memory which is connected to said processor, in response to a request to read data in the storage 
storage media so that input/output of data is conducted by medium from a hosl computer, converts lhe 
said host computers via the cache memory. da(a formal of tbc fead dala ^ a da(a formal ^ mao 

6. A computer system comprising; . . , . 

- . , . 35 particular data format used by the corresponding host com- 

a plurality ot host computers having different kinds of er ^ , he d forma , 

used by the corresponding host 

computer input/output interfaces; a storage control for r . . . . . . e Tj 

controUing data input/output to/from said host comput- com P uler does DOt matc * the P articular dala format aDd does 

ers- and DOt converl me data format of the read data into the 

„, t„ nr .i , t * , tt t . particular data format when tbe data format used by the 

at least one storage medium having a device input/output 40 v ^ VJ 

interface for conducting data input/output in a given corresponding host computer matches the particular data 

data formal for storing input/output data to/from said format. 

host computers, 13. A storage control apparatus according to claim 11 

said storage control including: further comprising: 

a plurality of control processors for controlling the 45 a management table for managing information indicating 

transfer of input/output data to/from the host lhe ^ formal used by eacn of the hosl computers> 

computers, each of said processors being connected wfacrcin a dala formal of a faost ^ fa delcrmincd 

to one of said plurality of host computers; and . f . , . f r 

a management table for managing the data format of by referTmg 10 531(5 inforalatlon - 

the input/output interface of each of said hosl com- 50 14 A borage control apparatus according to claim 13, 

puter; said information stored in said management table is set/ 

each of said control processors including a data format cancelled in response to addition or removal of a host 

converting unit which, in response to a request of computer and a corresponding control processor, 

read or write from associated host computer, con- 15. A storage control apparatus according to claim 14, 

verts ibe format of the read or write data into said 55 further comprising- 
given data formal when the data format of the 

associated host computer does nol match said given a c o^rol block for managing the state of access lo 

data format and does not to convert tbe format of the storage media by said host computers. 

read or write dala when ihey match. 16. A storage control apparatus according to claim 15, 

7. A computer system as defined m claim 6 in which lhe 60 wherein the number of said slorage media is arbitrarily 
content in said management table is set/canceled in response changed to add or remove storage media as desired. 

to another processor to allow a desired number of host n A . . i a- . i - n t 

computers and associated control processors to be added or 17 A S, ° rage COn,ro1 aCCOrd,D S 10 cIa,m U - further 

removed. composing: 

8. A computer system as defined in claim 7 and further 65 a cache memory which is connected to said storage media 
including a drive control block for managing tbe state of so that input/output of data is conducted by said bost 
access to said storage media by said host computers. computers via the cache memory. 
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18. A method of controlling transfer of data between a 
plurality of host computers and at least one storage medium 
which stores data of a particular data format, said particular 
data format being the same as a data format used by at least 
one of said host computers, said method comprising the 
steps of: 

controlling transfer of data between one of the host 
computers and the storage medium; and 

in response to a request to write data in the storage 
medium from a corresponding host computer, convert- 
ing the data format of the write data into the particular 
data formal when the data format of the write data does 
not match the particular data formal and not converting 
the data format of the write data into the particular data 
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format when the data format of the write data matches 
the particular data format. 
19. A method according to claim IS further comprising 
the step of: 

5 in response to a request to read data in the storage medium 
from a corresponding host computer, converting the 
data format of the read data into a data format other 
than the particular data format used by the correspond- 
ing host computer when the data format used by the 

10 host computer does not match the particular data format 
and not converting the data format of the read data into 
the particular data format when the data format used by 
the host computer matches the particular data format. 

***** 
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DATA STORAGE SYSTEM HAVING 
SEPARATE DATA TRANSFER SECTION AND 
MESSAGE NETWORK WITH BUS 
ARBITRATION 

BACKGROUND OF THE INVENTION 

This invention relates generally to data storage systems, 
and more particularly to data storage systems having redun- 
dancy arrangements to protect against total system failure in 
the event of a failure in a component or subassembly of the 
storage system. 

As is known in the art, large host computers and servers 
(collectively referred to herein as "host computer/servers") 
require large capacity data storage systems. These large 
computer/servers generally includes data processors, which 
perform many operations on data introduced to the host 
computer/server through peripherals including the data stor- 
age system. The results of these operations are output to 
peripherals, including the storage system. 

One type of data storage system is a magnetic disk storage 
system. Here a bank of disk drives and the host computer/ 
server are coupled together through an interface. The inter- 
face includes "front end" or host computer/server controllers 
(or directors) and "back-end" or disk controllers (or 
directors). The interface operates the controllers (or 
directors) in such a way that they are transparent to the host 
computer/server. That is, data is stored in, and retrieved 
from, the bank of disk drives in such a way that the host 
computer/server merely thinks it is operating with its own 
local disk drive. One such system is described in U.S. Pat. 
No. 5,206,939, entitled "System and Method for Disk Map- 
ping and Data Retrieval", inventors Moshe Yanai, Nalan 
Vishlitzky, Bruno Alterescu and Daniel Castel, issued Apr. 
27, 1993, and assigned to the same assignee as the present 
invention. 

As described in such U.S. Patent, the interface may also 
include, in addition to the host computer/server controllers 
(or directors) and disk controllers (or directors), addressable 
cache memories. The cache memory is a semiconductor 
memory and is provided to rapidly store data from the host 
computer/server before storage in the disk drives, and, on 
the other hand, store data from the disk drives prior to being 
sent to the host computer/server. The cache memory being a 
semiconductor memory, as distinguished from a magnetic 
memory as in the case of the disk drives, is much faster than 
the disk drives in reading and writing data. 

The host computer/server controllers, disk controllers and 
cache memory are interconnected through a backplane 
printed circuit board. More particularly, disk controllers are 
mounted on disk controller printed circuit boards. The host 
computer/server controllers are mounted on host computer/ 
server controller printed circuit boards. And, cache memo- 
ries are mounted on cache memory printed circuit boards. 
The disk directors, host computer/server directors, and cache 
memory printed circuit boards plug into the backplane 
printed circuit board. In order to provide data integrity in 
case of a failure in a director, the backplane printed circuit 
board has a pair of buses. One set the disk directors is 
connected to one bus and another set of the disk directors is 
connected to the other bus. Likewise, one set the host 
computer/server directors is connected to one bus and 
another set of the host computer/server directors is directors 
connected to the other bus. The cache memories are con- 
nected to both buses. Each one of the buses provides data, 
address and control information. 
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The arrangement is shown schematically in FIG. 1. Thus, 
the use of two buses Bl , B2 provides a degree of redundancy 
to protect against a total system failure in the event that the 
controllers or disk drives connected to one bus, fail. Further, 

s the use of two buses increases the data transfer bandwidth of 
the system compared to a system having a single bus. Thus, 
in operation, when the host computer/server 12 wishes to 
store data, the host computer 12 issues a write request to one 
of the front-end directors 14 (i.e., host computer/server 

3(J directors) to perform a write command. One of the front -end 
directors 14 replies to the request and asks the host computer 
12 for the data. After the request has passed to the requesting 
one of the front-end directors 14, the director 14 determines 
the size of the data and reserves space in the cache memory 
18 to store the request. The front-end director 14 then 

15 produces control signals on one of the address memory 
busses Bl, B2 connected to such front-end director 14 to 
enable the transfer to the cache memory 18. The host 
computer/server 12 then transfers the data to the front-end 
director 14. The front-end director 14 then advises the host 

20 computer/server 12 that the transfer is complete. The front- 
end director 14 looks up in a Table, not shown, stored in the 
cache memory 18 to determine which one of the back-end 
directors 20 (i.e., disk directors) is to handle this request. 
The Table maps the host computer/server 12 addresses into 

25 an address in the bank 14 of disk drives. The front-end 
director 14 then puts a notification in a "mail box" (not 
shown and stored in the cache memory 18) for the back-end 
director 20, which is to handle the request, the amount of the 
data and the disk address for the data. Other back-end 

3Q directors 20 poll the cache memory 18 when they are idle to 
check their "mail boxes". If the polled "mail box" indicates 
a transfer is to be made, the back-end director 20 processes 
the request, addresses the disk drive in the bank 22, reads the 
data from the cache memory 18 and writes it into the 
addresses of a disk drive in the bank 22. 

35 When data is to be read from a disk drive in bank 22 to 
the host computer/server 12 (be system operates in a recip- 
rocal manner. More particularly, during a read operation, a 
read request is instituted by the host computer/server 12 for 
data at specified memory locations (i.e., a requested data 

40 block). One of the front-end directors 14 receives the read 
request and examines the cache memory 18 to determine 
whether the requested data block is stored in the cache 
memory 18. If the requested data block is in the cache 
memory 18, the requested data block is read from the cache 

45 memory 18 and is sent to the host computer/server 12. If the 
front-end director 14 determines that the requested data 
block is not in the cache memory 18 (i.e., a so-called "cache 
miss") and the director 14 writes a note in the cache memory 
18 (i.e., the "mail box") that it needs to receive the requested 

50 data block. The back-end directors 20 poll the cache 
memory 18 to determine whether there is an action to be 
taken (i.e., a read operation of the requested block of data). 
The one of the back-end directors 20 which poll the cache 
memory 18 mail box and detects a read operation reads the 
requested data block and initiates storage of such requested 

55 data block stored in the cache memory 18. When the storage 
is completely written into the cache memory 18, a read 
complete indication is placed in the "mail box" in the cache 
memory 18. It is to be noted that the front-end directors 14 
are polling the cache memory 18 for read complete indica- 

60 tions. When one of the polling front-end directors 14 detects 
a read complete indication, such front-end director 14 com- 
pletes the transfer of the requested data which is now stored 
in the cache memory 18 to the host computer/server 12. 
The use of mailboxes and polling requires time to transfer 

65 data between the host computer/server 12 and the bank 22 of 
disk drives thus reducing the operating bandwidth of the 
interface. 
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SUMMARY OF THE INVENTION drives through a cache memory, such message network 
Id accordance with the present invention, a system inter- bein S ^dependent of the cache memory, 

face is provided. Such interface includes a plurality of first In accordance with another embodiment, a method is 
directors, a plurality of second directors, a data transfer provided for. operating a data storage system adapted to 

section and a message network. The data transfer section 5 transfer data between a host computer/server and a bank of 

includes a cache memory. The cache memory is coupled to disk drives through a system interface. The interface 

the plurality of first and second directors. The messaging includes a plurality of first directors coupled to host 

network operates independently of the data transfer section computer/server, a plurality of second directors coupled to 

and such network is coupled to the plurality of first directors the bank of disk drives; and a data transfer section having a 

and the plurality of second directors. The first and second 10 cache memory, such cache memory being coupled to the 

directors control data transfer between the first directors and plurality of first and second directors. The method comprises 

the second directors in response to message passing between transferring the data between the host computer/server and 

the first directors and the second directors through the the bank of disk drives under control of the first and second 

messaging network to facilitate data transfer between first directors in response to messages passing between the first 

directors and the second directors. The data passes through 15 directors and the second directors through a messaging 

the cache memory in the data transfer section. network to facilitate the data transfer between host 

With such an arrangement, the cache memory in the data computer/server and the bank of disk drives with such data 

transfer section is not burdened with the task of transferring passing through the cache memory in the data transfer 

the director messaging but rather a messaging network is section, such message network being independent of the 

provided, operative independent of the data transfer section, 20 cacne memory. 

for such messaging thereby increasing the operating band- nRlFf7 nF o rnT1>Tinw ni7 ^ v nDAWFMrQ „ 

width of the system interface. BR1LF DESCRIPTION OF THE DRAWINGS 

In one embodiment of the invention, the system interface These and other features of the invention will become 

each one of the first directors includes a data pipe coupled 2J more readily apparent from the following detailed descrip- 

between an input of such one of the first directors and the lion when read together with the accompanying drawings, in 

cache memory and a controller for transferring the messages which: 

between the message network and such one of the first FIG. 1 is a block diagram of a data storage system 

directors. according to the PRIOR ART; 

In one embodiment each one of the second directors 30 fig. 2 is a block diagram of a data storage system 

includes a data pipe coupled between an input of such one according to the invention* 

it ^r t l^ eCl0rS t H nd Cachc h mem0r) ; K and 3 con - FIG.2AshowsthefieIdsofadescri P torusedintbesystem 

Holler Tor transferring the messages between the message interface of (he d £ * 

network and such one of the second directors. „ , 

In one embodiment the directors include: a data pipe 35 FIG. 2B shows the filed used in a MAC packet used in the 

coupled between an input of such one of the first directors sys i™ l ™ rl *<* ° f ^ data Sl ° ragC SySlCm ° f RG ' 2; 

and the cache memory; a microprocessor; and a controller FIG " 3 * a sketcb of 30 eleclncal cabinet storing a system 

coupled to the microprocessor and the data pipe for con- interface used m the data storage system of FIG. 2; 

trolling the transfer of the messages between the message FIG - 4 is a diagramatical, isometric sketch showing 

network and such one of the first directors and for control- 40 P rinled circuit boards providing the system interface of the 

ling the data between the input of such one of the first data storage system of FIG. 2; 

directors and the cache memory. FIG. 5 is a block diagram of the system interface used in 

In accordance with another feature of the invention, a data the data storage system of FIG. 2; 

storage system is provided for transferring data between a FIG. 6 is a block diagram showing the connections 

host computer/server and a bank of disk drives through a 45 between front -end and back-end directors to one of a pair of 

system interface. The system interface includes a plurality of message network boards used in the system interface of the 

first directors coupled to host computer/server, a plurality of data storage system of FIG. 2; 

second directors coupled to the bank of disk drives, a data FIG; 7 is a block diagram of an exemplary one of the 

transfer section, and a message network. The data transfer director boards used in the system interface of he data 

section includes a cache memory. The cache memory is 50 storage system of FIG. 2; 

coupled to the plurality of first and second directors. The RG. 8 is a block diagram of the system interface used in 

message network is operative independently of the data tbe data slorage sys(em of FIG 2; 

transfer section and such network is coupled to the plurality ox ■ j* r 1 1 t_ . 

of first directors and the plurality of second directors. The ^ * 3 d "?T ° f aD ^P 1 "* global cache 

first and second directorscontro'data transfer between the 55 ^ d a ****** * 

host computer and the bank of disk drives in response to FI ?* * B 15 a diagram showin £ a P air of direc!or b ° a 'ds 

messages passing between the first directors and the second C0Upled belweeo a pau " of hosl P r <> cess °rs ™<* global cache 

directors through the messaging network to facilitate the memory boards used m the system interface of FIG. 8; 

data transfer between host computer/server and the bank of FIG " 8C is a Wock dia fi«ra of a n exemplary crossbar 

disk drives with such data passing through the cache 6 o swilCD ^ tbe front - enci and read-end directors of the 

memory in the data transfer section. system interface of FIG. 8; 

In accordance with yet another embodiment, a method is ^ G • 9 ^ a block diagram of a transmit Direct Memory 

provided for opera ting a data storage system adapted to Access (DMA) used in the system interface of the FIG. 8; 

transfer data between a host computer/server and a bank of FIG. 10 is a block diagram of a receive DMA used in the 

disk drives. The method includes transferring messages 65 system interface of FIG. 8; 

through a messaging network with the data being transferred FIG. 11 shows the relationship between FIGS. 11A and 

between the host computer/server and the bank of disk 11B, such FIGS. 11A and 11B together showing a process 
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flow diagram of the send operation of a message network network 260 operates independent of the data transfer sec- 
used in the system interface of FIG. 8; lion 240 thereby increasing the operating bandwidth of the 

FIGS. 11C-11E are examples of digital words used by the system interface 160. 

message network in the system interface of FIG. 8; ' n operation, and considering first a read request by the 

FIG. 11F shows bits in a mask used in such message 5 b^st computer/server 120 (i.e., the host computer/server 120 

network requests data from the bank of disk drives 140), the request 

CTO t ' ir , . . . f . . rr __ ^ is passed from one of a plurality of, here 32, host computer 

nG. UG shows the result o : he maskof FIG. UFappued maasM 121.-121,. in the host computer 120 to one or 

to the digital word shown in FIG. HE; more of lhe ^ ^ m _ m ^ 

FIG. 12 shows the relationship between FIGS. 12A and 10 connected to such host computer processor 121,-12132. 0* 

12B, such FIGS. 12A and 12B Showing a process flow is noted that in the host computer 120, each one of the host 

diagram of the receive operation of a message network used computer processors 121,-12132 is coupled to here a pair 

m the system interface of FIG. 8; (bul Q0 ( Iimiled l0 a pair) of i hc froot ^ nd directors 

FIG. 13 shows the relatioaship between FIGS. 11 A and 180j-180 32 to provide redundancy in the event of a failure 

11B, such FIGS. UA and 11B together showing a process is in one of the front end-directors 180,-18032 coupled thereto, 

flow diagram of the acknowledgement operation of a mes- Likewise, the bank of disk drives 140 has a plurality of, here 

sage network used in the system interface of FIG. 8; 32, disk drives 141,-141 32 , each disk drive 141,-141 32 

FIGS. 14A and 14B show process flow diagrams of the DeiD 8 coupled to here a pair (but not limited to a pair) of the 

transmit DMA operation of the transmit DMA of FIG. 9; and back-end directors 200,-200 32 , to provide redundancy in the 

FIGS. 15A and 15B show process flow diagrams of the 20 *™ nt °« a failUrC ° ne ° f the back ' end direclor * 

receive DMA operation of the receive DMA of FIG. 10. T™ 1 _ 32 cou P led thereto). Each front-end director 

180,-180 32 includes a microprocessor (^P) 299 (i.e., a 

DETAILED DESCRIPTION central processing unit (CPU) and RAM) and will be 

described in detail in connection with FIGS. 5 and 7. Suffice 

Referring now to FIG. 2, a data storage system 100 is 2 5 it to say here, however, that the microprocessor 299 makes 

shown for transferring data between a hosl computer/server a request from the data from the global cache memory 220. 

120 and a bank of disk drives 140 through a system interface The global cache memory 220 has a resident cache man- 

160. The system interface 160 includes: a plurality of, here agement table, not shown. Every director 180,-180 32 , 

32 front-end directors 180,-18032 coupled to the host 200,-200 32 has access to the resident cache management 

computer/server 120 via ports-123 32 ; a plurality of back-end 30 table and every time a front-end director 180,-180 32 

directors 200,-200 32 coupled to the bank of disk drives 140 requests a data transfer, the front-end director 180,-180 32 

via ports 123 33 -123 M ; a data transfer section 240, having a must query lhe global cache memory 220 to determine 

global cache memory 220, coupled to the plurality of whether the requested data is in the global cache memory 

front-end directors 180,-180, 6 and the back-end directors 220. If the requested data is in the global cache memory 220 

200,-200, 6 ; and a messaging network 260, operative inde- 35 (i.e., a read "hit"), the front-end director 180,-180 32 , more 

pendenlly of the data transfer seclion 240, coupled to the particularly the microprocessor 299 therein, mediates a 

plurality of front-end directors 180,-18032 and the plurality DMA (Direct Memory Access) operation for the global 

of back-end directors 200,-20032, a s shown. The front-end cache memory 220 and the requested data is transferred to 

and back-end directors 180,-180 32 , 200,-200 32 are tunc- the requesting host computer processor 121,-121 32 . 

tionaUy similar and include a microprocessor (^P) 299 (i.e., 4 o If, on the other hand, the front-end director 180,-180 32 

a central processing unit (CPU) and RAM), a message receiving the data request determines that the requested data 

engine/CPU controller 314 and a data pipe 316 to be is not f n tne global cache memory 220 (i.e., a "miss") as a 

described m detail in connection with FIGS. 5, 6 and 7. result of a query of the cache management table in the global 

Suffice it to say here, however, that the front-end and caCDe mem ory 220, such front-end director 180,-180 32 

back-end directors 180,-180 32 , 200,-200 32 control data 45 concludes that the requested data is in the bank of disk drives 

transfer between the host computer/server 120 and the bank 140. Thus the front-end director 180,-180 32 that received 

of disk drives 140 in response to messages passing between the request for the data must make a request for the data from 

the directors 180,-180 32 , 200,-200 32 through the messag- one of the back-end directors 200,-20032 in order for such 

ing network 260. The messages facilitate the data transfer back-end director 200,-200 32 to request the data from the 

between host computer/server 120 and the bank of disk 50 bank of disk drives 140. Tbe mapping of which back-end 

drives 140 with such data passmg through the global cache directors 200,-200 32 control which disk drives 141,-141 32 

memory 220 via the data transfer section 240. More m | he bank of disk drives 140 is determined during a 

?on 1C , U in ly ' J D thC ° aSe ° f !hC from - end ^c^s power-up initialization phase. The map is stored in the 

180,-180 32 , the data passes between the host computer to global cache memory 220. Thus, when the front-end director 

the global cache memory 220 through the data pipe 316 in 55 180,-180 32 makes a request for data from tbe global cache 

the front-end directors 180,-180 32 and the messages pass memory 220 and determines that the requested data is not in 

through the message engine/CPU controller 314 in such the global cache memory 220 (i.e., a "miss"), the front-end 

front-end directors 180,-180 32 . In the case of the back-end director 180,-180 32 is also advised by the map in the global 

directors 200,-200 32 the data passes between the back-end cac be memory 220 of the back-end director 200,-200,, 

directors 200,-200 32 and the bank of disk drives 140 and the 60 responsible for the requested data in the bank of disk drives 

global cache memory 220 through the data pipe 316 in the 140. The requesting front-end director 180,-180 32 then 

back-end directors 200,-200 32 and again the messages pass mU st make a request for the data in the bank of disk drives 

through the message engine/CPU controller 314 in such 140 from tbe map designated back-end director 200,-200 32 

back-end director 200,-200 32 . 7^ requcsl ^tweeo the front-end director 180,-18032 and 

With such an arrangement, the cache memory 220 in the 65 the appropriate one of the back-end directors 200,-200 32 (as 

data transfer section 240 is not burdened with the task of determined by the map stored in the global cacbe memory 

transferring the director messaging. Rather the messaging 200) is by a message which passes from the front-end 
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director 180,-180 32 through the message network 260 lo the 
appropriate back-end director 200,-200 32 . It is noted then 
that the message does not pass through the globaJ cache 
memory 220 (i.e., does not pass through the data transfer 
section 240) but rather passes through the separate, inde- 5 
pendent message network 260. Thus, communication 
between the directors 180,-180 32 , 200,-200 32 is through 
the message network 260 and not through the globaJ cache 
memory 220. Consequently, valuable bandwidth for the 
global cache memory 220 is not used for messaging among 
the directors 180,-180 32 , 200,-200 32 . 

Thus, on a global cache memory 220 "read miss", the 
front-end director 180,-18032 sends a message to the appro- 
priate one of the back-end directors 200,-200 32 through the 
message network 260 to instruct such back-end director ]5 
200,-200 32 to transfer the requested data from the bank of 
disk drives 140 to the global cache memory 220. When 
accomplished, the back-end director 200,-200 32 advises the 
requesting front-end director 180,-18032 that the transfer is 
accomplished by a message, which passes from the back- 20 
end director 200,-200 32 lo the front-end director 
180,-180 32 through the message network 260. In response 
to the acknowledgement signal, the front -end director 
180,-1 80 32 is thereby advised that such front-end director 
180,-1 80 32 can transfer the data from the global cache 25 
memory 220 to the requesting host computer processor 
121j-121 32 as described above when there is a cache "read 
hit". 

It should be noted that there might be one or more 
back-end directors 200 1 -200 32 responsible for the requested 30 
data. Thus, if only one back-end director 200,-20032 >s 
responsible for the requested data, the requesting front -end 
director 180,-180 32 sends a uni-cast message via the mes- 
sage network 260 to only that specific one of the back-end 
directors 200,-200 32 . On the other hand, if more than one of 35 
the back-end directors 200,-200 32 is responsible for the 
requested data, a multi-cast message (here implemented as 
a series of uni-cast messages) is sent by the requesting one 
of the front-end directors 180,-180 32 to all of the back-end 
directors 200,-200 32 having responsibility for the requested 40 
data. In any event, with both a uni-cast or multi-cast 
message, such message is passed through the message 
network 260 and not through the data transfer section 240 
(i.e., not through the global cache memory 220). 

Likewise, it should be noted that while one of the host 45 
computer processors 121,-121 32 might request data, the 
acknowledgement signal may be sent lo the requesting host 
computer processor 121, or one or more other host computer 
processors 121,-12132 v * a a multi-cast (i.e., sequence of 
uni-cast) messages through the message network 260 to 50 
complete the data read operation. 

Considering a write operation, the host computer 120 
wishes lo write data into storage (i.e., into the bank of disk 
drives 140). One of the front-end directors 180,-180 32 
receives the data from the host computer 120 and writes it 55 
into the global cache memory 220. The front-end director 
180,-18032 then requests the transfer of such data after 
some period of lime when the back-end director 200,-200 32 
determines that the data can be removed from such cache 
memory 220 and stored in the bank of disk drives 140. 60 
Before the transfer to the bank of disk drives 140, the data 
in the cache memory 220 is tagged with a bit as "fresh data" 
(i.e., data which has not been transferred to the bank of disk 
drives 140, thai is data which is "write pending"). Thus, if 
there are multiple write requests for the same memory 65 
location in the globaJ cache memory 220 (e.g., a particular 
bank account) before being transferred to the bank of disk 
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drives 140, the data is overwritten in the cache memory 220 
with the most recent data. Each time data is transferred to the 
global cache memory 220, the front-end director 180,-18032 
controlling the transfer also informs ihe host computer 120 
that the transfer is complete to thereby free-up the host 
computer 120 for other data transfers. 

When it is time to transfer the data in the global cache 
memory 220 to the bank of disk drives 140, as determined 
by the back-end director 200,-200 32 , the back-end director 
200,-200 32 transfers the data from the global cache memory 
220 to the bank of disk drives 140 and resets the tag 
associated with data in the global cache memory 220 (i.e., 
un-tags (be data) to indicate that the data in the global cache 
memory 220 has been transferred to the bank of disk drives 
140. It is noted that the un-tagged data in the global cache 
memory 220 remains there until overwritten with new data. 

Referring now to FIGS. 3 and 4, the system interface 160 
is shown to include an electrical cabinet 300 having stored 
therein: a plurality of, here eight front-end director boards 
190,-190 8 , each one having here four of the front-end 
directors 180,-180 3 2; a plurality of, here eight back-end 
director boards 210,-210 8 , each one having here four of the 
back-end directors 200,-200 32 ; and a plurality of, here 
eight, memory boards 220/ which together make up the 
global cache memory 220. These boards plug into the front 
side of a backplane 302. (It is noted that the backplane 302 
is a mid-plane printed circuit board). Plugged into the 
backside of the backplane 302 are message network boards 
304,,304 2 . The backside of the backplane 302 has plugged 
into it adapter boards, not shown in FIGS. 2-4, which couple 
the boards plugged into the back-side of the backplane 302 
with the computer 120 and the bank of disk drives 140 as 
shown in FIG. 2. That is, referring again briefly to FIG. 2, 
an I/O adapter, not shown, is coupled between each one of 
the front-end directors 180,-180 32 and the host computer 
120 and an I/O adapter, not shown, is coupled between each 
one of the back-end directors 200,-200 32 and the bank of 
disk drives 140. 

Referring now to FIG. 5, the system interface 160 is 
shown to include the director boards 190,-190 8 , 210,-210 8 
and the global cache memory 220, plugged into the back- 
plane 302 and the disk drives 141 ,-141 32 in the bank of disk 
drives along with the host computer 120 also plugged into 
ihe backplane 302 via I/O adapter boards, not shown. The 
message network 260 (FIG. 2) includes the message net- 
work boards 304, and 304 2 . Each one of the message 
network boards 304, and 304 2 is identical in construction. A 
pair of message network boards 304, and 304 2 is used for 
redundancy and for message load balancing. Thus, each 
message network board 304,, 304 2 , includes a controller 306 
(i.e., an initialization and diagnostic processor comprising a 
CPU, system controller interface and memory, as shown in 
FIG. 6 for one of the message network boards 304,, 304 2 
here board 304,) and a crossbar switch section 308 (e.g., a 
switching fabric made up of here four switches 308,-308 4 ). 

Referring again to FIG. 5, each one of the director boards 
190,-210 8 includes, as noted above four of the directors 
180,-180 32 , 200,-200 32 (FIG. 2). It is noted that the director 
boards 190,-1 90 8 having four front-end directors per board, 
180,.180 32 are referred lo as front-end directors and the 
director boards 210,-210 8 having four back-end directors 
per board, 200, -200 32 are referred to as back-end directors. 
Each one of the directors 180,-180 32 , 200,-200 32 includes 
a CPU 310, a RAM 312 (which make up the microprocessor 
299 referred to above), the message engine/CPU controller 
314, and the data pipe 316. 

Each one of the director boards 190,-210 8 includes a 
crossbar switch 318. The crossbar switch 318 has four 
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input/output ports 319, each ooe being coupled lo the data packetizer/de-packetizer 428 (FIG. 7) into a MAC type 

pipe 316 of a corresponding one of the four directors packet, shown in FIG. 2B, here using the NGIO protocol 

-20032 on the director board. 190,-2 10 8 . specification. There arc three types of packets: a message 

The crossbar switch 318 has eight output/input ports col- packet section; an acknowledgement packet; and a message 

lectively identified in FIG. 5 by numerical designation 321 5 network fabric management packet, the latter being used to 

(which plug into the backplane 302. The crossbar switch 318 establish the message network routing during initialization 

on the front-end director boards 191 1 -191 a is used for (i.e., during power-up). Each one of the MAC packets has: 

coupling the data pipe 316 of a selected one of the four an 8-byte header which includes source (i.e., transmitting 

front-end directors 180,-180 32 on the front-end director director) and destination (i.e., receiving director) address; a 
board 190,-1908 to the global cache memory 220 via the 30 payload; and terminates with a 4-byte Cyclic Redundancy 

backplane 302 and I/O adapter, not shown. The crossbar Check (CRC), as shown in FIG. 2B. The acknowledgement 

switch 318 on the back-end director boards 210,-210 s is packet (i.e., signal) has a 4-byte acknowledgment payload 

used for coupling the data pipe 316 of a selected one of the section. The message packet has a 32-byte payload section, 

four back-end directors 200,-200 32 on the back-end director The Fabric Management Packet (FMP) has a 256-byte 
board 210,-210 8 to the global cache memory 220 via the ]5 payload section. The MAC packet is sent to the crossbar 

backplane 302 and I/O adapter, not shown. Thus, referring switch 320. The destination portion of the packet is used to 

to FIG. 2, the data pipe 316 in the front-end directors indicate the destination for the message and is decoded by 

180,-180 32 couples data between the host computer 120 and the switch 320 to determine which port the message is lo be 

the global cache memory 220 while the data pipe 316 in the routed. The decoding process uses a decoder table 327 in the 
back-end directors 200,-200 32 couples data between the 20 switch 318, such table being initialized by controller during 

bank of disk drives 140 and the global cache memory 220. power-up by the initialization and diagnostic processor 

It is noted that there are separate point-to-point data paths (controller) 306 (FIG. 5). The table 327 (FIG. 7) provides the 

Pi-?64 (FIG. 2) between each one of the directors relationship between the destination address portion of the 

180^18032,200,-20032 and the global cache memory 220. MAC packet, which identifies the routing for the message 
It is also noted that the backplane 302 is a passive backplane 25 and the one of the four directors 180,-180 32 , 200,-200 32 on 

because it is made up or only etched conductors on one or the director board 190,-190 8 , 210,-210 8 or to one of the 

more layers of a printed circuit board. That is, the backplane message network boards 304,, 304 2 to which the message is 

302 does not have any active components. to be directed. 

Referring again to FIG. 5, each one of the director boards More particularly, and referring to FIG. 5, a pair of 
190,-210 8 includes a crossbar switch 320. Each crossbar 30 output/input ports 325 ,, 325 2 is provided for each one of the 
switch 320 has four input/output ports 323, each one of the crossbar switches 320, each one being coupled to a corre- 
four input/output ports 323 being coupled to the message sponding one of the pair of message network boards 304,, 
engine/CPU controller 314 of a corresponding one of the 304 2 . Thus, each one of the message network boards 304,, 
four directors 180,-18032,200,-20032 00 the director board 304 2 has sixteen input/output ports 322,-322 J6> each one 
190,-210 8 . Each crossbar switch 320 has a pair of output/ 35 being coupled to a corresponding one of the output/input 
input ports 325,, 325 2 , which plug into the backplane 302. ports 325 lt 325 2 , respectively, of a corresponding one of the 
Each port 325,-325 2 is coupled to a corresponding one of director boards 190,-190 8 , 210j-210 a through the back- 
thc message network boards 304,, 304 2 , respectively, plane 302, as shown. Thus, considering exemplary message 
through the backplane 302. The crossbar switch 320 on the network board 304 JS FIG. 6, each switch 308,-308 4 also 
front-end director boards 190,-190 8 is used to couple the 40 includes three coupling ports 324, -324 3 .The coupling ports 
messages between the message engine/CPUcontroller 314 324,-324 3 are used to interconnect the switches 322 a -322 4 , 
of a selected one of the four front-end directors 180,-180 32 as shown in FIG. 6. Thus, considering message network 
on the front-end director boards 190,-190 8 and the message board 304,, input/output ports 322,-322 8 are coupled to 
network 260, FIG. 2. Likewise, the back-end director boards output/input ports 325, of front-end director boards 
210,-210 8 are used to couple the messages produced by a 45 190,— 190 8 and input/output 001153225— 322 J6 are coupled to 
selected one of the four back-end directors 200,-200 32 on output/input ports 325, of back-end director boards 
the back-end director board 210,-210 a between the message 210,-210 8 , as shown. Likewise, considering message net- 
engine/CPU controller 314 of a selected one of such four work board 304 2 , input/output ports 322,-322 8 thereof are 
back-end directors and the message network 260 (FIG. 2). coupled, via the backplane 302, to output/input ports 325 2 of 
Thus, referring also to FIG. 2, instead of having a separate 50 front-end director boards 1 90,-1 90 8 and input/output ports 
dedicated message path between each one of the directors 322 9 -322 J6 are coupled, via the backplane 302, to output/ 
180,-180 32 , 200,-200 32 and the message network 260 input ports 325 2 of back-end director boards 210,-210 8 . 
(which would require M individual connections to the As noted above, each one of the message network boards 
backplane 302 for each of the directors, where M is an 304,, 304 2 includes a processor 306 (FIG. 5) and a crossbar 
integer), here only M/4 individual connections are required). 55 switch section 308 having four switches 308,-308 4 , as 
Thus, the total number of connections between the directors shown in FIGS. 5 and 6. The switches 308,-308 4 ' are 
180,-180 32 200,-200 32 and the backplane 302 is reduced to interconnected as shown so that messages can pass between 
Wth. Thus, it should be noted from FIGS. 2 and 5 that the any pair of the input/output ports 322,-322 J6 . Thus, it 
message network 260 (FIG. 2) includes the crossbar switch follow that a message from any one of the front-end direc- 
320 and the message network boards 304,, 304 2 . 60 tors 180,-180 32 can be coupled to another one of the 

Each message is a 64-byte descriptor, shown in FIG. 2A) front-end directors 180,.180 32 and/or to any one of the 

which is created by the CPU 310 (FIG. 5) under software back-end directors 200,-20032. Likewise, a message from 

control and is stored in a send queue in RAM 312. When the any one of the back-end directors 180,-180 32 can be 

message is to be read from the send queue in RAM 312 and coupled to another one of the back-end directors 180,.1803 2 

transmitted through the message network 260 (FIG. 2) to 65 and/or lo any one of the front-end directors 200,-200 32 . 

one or more other directors via a DMA operation to be As noted above, each MAC packet (FIG. 2B) includes in 

described, it is packetized in the packetizer portion of an address destination portion and a data payload portion. 
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The MAC header is used to indicate the destination for the 
MAC packet and such MAC header is decoded by the switch 
to determine which port the MAC packet is to be routed. The 
decoding process uses a table in the switch 308,-308 4 , such 
table being initialized by processor 306 during power-up. 5 
The table provides the relationship between the MAC 
header, which identifies the destination for the MAC packet 
and the route to be taken through the message network. 
Thus, after initialization, the switches 320 and the switches 
308,-308 4 in switch section 308 provides packet routing 
which enables each one of the directors 180,-180 32 , 
200,-200 32 to transmit a message between itself and any 
other one of the directors, regardless of whether such other 
director is on the same director board 190,-180 8 , 210 1 -210 8 
or on a different director board. Further, the MAC packet has 
an additional bit B in the header thereof, as shown in FIG. 15 
2B, which enables the message to pass through message 
network board 304, or through message network board 
304 2 . During normal operation, this additional bit B is 
toggled between a logic 1 and a logic 0 so that one message 
passes through one of the redundant message network 20 
boards 304,, 304 2 and the next message to pass through the 
one of the message network boards304,,304 2 to balance the 
load requirement on the system. However, in the event of a 
failure in one of the message network boards 304, , 304 2 , the 
non-failed one of the boards 304,, 304 2 is used exclusively 25 
until the failed message network board is replaced. 

Referring now to FIG. 7, an exemplary one of the director 
boards 190,-190 8 , 210j-210 8 , here director board 190, is 
shown to include directors 180,, 180 3 , 180 5 and 180 7 . An 
exemplary one of the directors 180,-180 4 , here 180, is 3Q 
shown in detail to include the data pipe 316, the message 
engine/CPU controller 314, the RAM 312, and the CPU 310 
all coupled to the CPU interface bus 317, as shown. The 
exemplary director 180, also includes: a local cache 
memory 319 (which is coupled to the CPU 310); the 
crossbar switch 318; and, the crossbar switch 320, described 35 
briefly above in connection with FIGS. 5 and 6. The data 
pipe 316 includes a protocol translator 400, a quad port 
RAM 402 and a quad port RAM controller 404 arranged as 
shown. Briefly, the protocol translator 400 converts between 
the protocol of the host computer 120, in the case of a 40 
front -end director 180,-1 80 32 , (and between the protocol 
used by the disk drives in bank 140 in the case of a back-end 
director 200,-200 32 ) and the protocol between the directors 
180,-180 3 , 200,-20032 and the global memory 220 (FIG. 
2). More particularly, the protocol used the host computer 45 
120 may, for example, be fibre channel, SCSI, ESCON or 
FICON, for example, as determined by the manufacture of 
the host computer 120 while the protocol used internal to the 
system interface 160 (FIG. 2) may be selected by the 
manufacturer of the interface 160. The quad port RAM 402 so 
is a FIFO controlled by controller 404 because the rate data 
coming into the RAM 402 may be different from the rate 
data leaving the RAM 402. The RAM 402 has four ports, 
each adapted to handle an 18 bit digital word. Here, the 
protocol translator 400 produces 36 bit digital words for the 55 
system interface 160 (FIG. 2) protocol, one 18 bit portion of 
the word is coupled to one of a pair of the ports of the quad 
port RAM 402 and the other 18 bit portion of the word is 
coupled to the other one of the pair of the ports of the quad 
port RAM 402. The quad port RAM has a pair of ports 60 
402A, 402B, each one of to ports 402A, 402B being adapted 
to handle an 18 bit digital word. Each one of the ports 402 A, 
402B is independently controllable and has independent, but 
arbitrated, access to the memory array within the RAM 402. 
Data is transferred between the ports 402A, 402B and the 65 
cache memory 220 (FIG. 2) through the crossbar switch 318, 
as shown. 
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The crossbar switch 318 includes a pair of switches 406 A, 
406B. Each one of the switches 406A, 406B includes four 
input/output director-side ports D,-D 4 (collectively referred 
to above in connection with FIG. 5 as port 319) and four 
input/output memory-side ports M,-M 4 , M 5 -M 8 , 
respectively, as indicated. The input/output memory-side 
ports M,-M 4 , M^-Mg were collectively referred to above in 
connection with FIG. 5 as port 317). The director-side ports 
D,-D 4 of switch 406 A are connected to the 402A ports of 
the quad port RAMs 402 in each one the directors 180 3 , 
180 3 , I8O5 and 180 7 , as indicated. Likewise, director-side 
ports of switch 406B are connected to the 402B ports of the 
quad port RAMs 402 in each one the directors 180,, 180 3 , 

180 5 , and 180 7 as indicated. The ports D,-D 4 are selectively 
coupled to the ports M a -M 4 in accordance with control 
words provided to the switch 406A by the controllers in 
directors 180,, 180 3 , 180 s , 180 7 on busses R A1 -R >14 , 
respectively, and the ports D,-D 4 are coupled to ports 
M*-M 8 in accordance with the control words provided to 
switch 406B by the controllers in directors 180,, I8O3, 180 3 , 
180 7 on busses R fll -R 54> as indicated. The signals on buses 
R^-R^ 4 are request signals. Thus, port 402A of any one of 
the directors 180,, 180 3 , 180 5 , 180 7 may be coupled to any 
one of the ports M,-M 4 of switch 406A, selectively in 
accordance with the request signals on buses R ylJ -R A4 . 
Likewise, port 402B of any one of the directors 180,-180 4 
may be coupled to any one of the ports M 5 -M 8 of switch 
406B, selectively in accordance with the request signals on 
buses R B1 -R* 4 . The coupling between the director boards 
190,-190 s , 210,-210 8 and the global cache memory 220 is 
shown in FIG. 8. 

More particularly, and referring also to FIG. 2, as noted 
above, each one of the host computer processors 121 j-121 32 
in the host computer 120 is coupled to a pair of the front -end 
directors 180,-180 32 to provide redundancy in the event of 
a failure in one of the front end-directors 181 2 — 181 32 
coupled thereto. Likewise, the bank of disk drives 140 has 
a plurality of, here 32, disk drives 141,-141 32 , each disk 
drive 141,-141 32 being coupled to a pair of the back-end 
directors 200,-200 32f to provide redundancy in the event of 
a failure in one of the back-end directors 200,-200 32 
coupled thereto). Thus, considering exemplary host com- 
puter processor 121,, such processor 121, is coupled to a 
pair of front-end directors 180,, 180 2 . Thus, if director 180, 
fails, the host computer processor 121, can still access the 
system interface 160, albeit by the other front-end director 
180 2 . Thus, directors 180, and 180 2 are considered redun- 
dancy pairs of directors. Likewise, other redundancy pairs of 
front -end directors are: front-end directors 180 3 , 180 4 ; 180 5 , 

180 6 . 180 7 , 180 8 . 180 9 , 180, 0 ; 180 n , 180, 2 ; 180 a3 , 180 14 . 
180,; 180 16 . 180, 7 , 180, 8 180, 9 , 180^; 180 2 „ 180^ 
180 2> 180 24 .180 25 , 180 26 ; i80 27 , 180^; 180 29 , 180^; and 
180 3J 180 32 (only directors 180 31 and 180 32 being shown in 
FIG. 2). 

Likewise, disk drive 141, is coupled to a pair of back-end 
directors 200,, 200 2 . Thus, if director 200, fails, the disk 
drive 141, can still access the system interface 160, albeit by 
the other back-end director 180 2 . Thus, directors 200, and 
200 2 are considered redundancy pairs of directors. Likewise, 
other redundancy pairs of back-end directors are: back-end 
directors 200 3 , 200 4 ; 200 5 , 200^ 200 7 200 8 . 200 9 200, o ; 
200,,, 200, 2 ; 200, 3 , 200, 4 ; 200, 5 200, fi . 200 17 ," 200 18 . 
200, 9 , 200^ 200 21 , 200^. 2OO23' 200 24 .' 200 2S , 200 26 | 
200 27 , 200^ 200 29 , 200^; and 200 31 , 200 32 (only directors 
200 3 , and 200 32 being shown in FIG. 2). Further, referring 
also to FIG. 8, the global cache memory 220 includes a 
plurality of, here eight, cache memory boards 200,-200 8 , as 
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shown. Still further, referring to FIG. 8A, an exemplary one 180 3J ; front-end directors 180 18) 18_^, 24 . 

of the cache memory boards, here board 220, is shown in Thus, here back-end director boards 21(£-210 8 have 

detail and is described in detail in VS. Pal. No. 5,943,287 thereon: back-end directors 200 1( 200 3 , 200 5 and 200 7 ; 

entitled "Fault Tolerant Memory System", John IC Walton, back-end directors 20^, 200^ 200 6 and 200 e ; back-end 

inventor, issued Aug. 24, 1999 and assigned to the same 5 directors 200 9 , 200 1Jt 200 J3 and 200 J3 ; back-end directors 

assignee as the present invention, the entire subject matter 200 JO , 200 12 , 200 14 and 200, s ; back-end directors 200 17 , 

therein being incorporated herein by reference. Thus, as 200 J9 , 200 21 and 200^; back-end directors 200, e , 200^, 

shown in FIG. 8A, the board 220, includes a plurality of, 200 22 and 200 24 ; back-end directors 200^, 200 27 , 200^ and 

here four RAM memory arrays, each one of the arrays has 20O 3J ; back-end directors 200, 8 , 200^, 200 22 and 200^; 

a pair of redundant ports, i.e., an A port and a B port. The l0 Thus, here front-end director 180,, shown in FIG. 8A, is 

board itself has sixteen ports; a set of eight A ports M A} -M A8 oo front-end director board 190 2 and its redundant front -end 

and a set of eight B ports M B1 -M Ba . Four of the eight Aport, director 180^ shown in FIG. 8B, is on another front -end 

here A ports M^,-M^ 4 are coupled lo the M, port of each of director board, here for example, front-end director board 

the front-end director boards 190,, 190 3 , 190 s , and 190 7 , 190 2 . As described above, the port 402A of the quad port 

respectively, as indicated in FIG. 8. Four of the eight B port, 15 RAM 402 (i.e., the A port referred to above) is connected lo 

here B ports M^-M^ are coupled to the M, port of each of switch 406A of crossbar switch 318 and the port 402B of the 

the front-end director boards 190 2 , 190 4 , 190 6 , and 190 8 , quad port RAM 402 (i.e., the B port referred to above) is 

respectively, as indicated in FIG. 8. The other four of the connected to switch 406B of crossbar switch 318. Likewise, 

eight A port, here A ports M Ar -M AS are coupled to the M, for redundant director 180 2 , However, the ports M,-M 4 of 

port of each of the back-end director boards 210,, 210 3? 20 switch 406A of director 180, are connected to the M^, ports 

210 5 , and 210 7 , respectively, as indicated in FIG. 8. The of global cache memory boards 220, -200,,, as shown, while 

other four of the eight B port, here B ports M fl5 -M 48 are for its redundancy director 180 2 , the ports M,-M 4 of switch 

coupled to the M, port of each of the back-end director 406A are connected to the redundant M B1 ports of global 

boards 210 2 , 210 4 , 210 6 , and 210 8 , respectively, as indicated cache memory boards 220,-2004, as shown, 

in FIG. 8 25 Referring in more detail to the crossbar switch 318 (FIG. 

Considering the exemplary four A ports M A ,-M A4 , each 7), as noted above, each one of the director boards 

one of the four A ports M^j-M^ can be coupled to the A 190,-210 8 has such a switch 318 and such switch 318 

port of any one of the memory arrays through the logic includes a pair of switches 406A, 406B. Each one of the 

network 221 M . Thus, considering port M A „ such port can be switches 406A, 406B is identical in construction, an exem- 

coupled to the A port of the four memory arrays. Likewise, 30 plary one thereof, here switch 406A being shown in detail in 

considering the four A ports M^j-M^g, each one of the four FIG. 8C Thus switch 406A includes four input/output 

A ports M A5 -M A8 can be coupled to the A port of any one director-side ports Dj-D,, as described in connection with 

of the memory arrays through the logic network 221^. exemplary director board 190 Thus, for the director board 

Likewise, considering the four B ports M Bl -M B4 , each one 190, shown in FIG. 7, the four input/output director-side 

of the four B ports M B1 -M B4 can be coupled to the B port 35 P orl s D]~D 4 of switch 406A are each coupled to the port 

of any one of the memory arrays through logic network 402 A of a corresponding one of the directors 180^ 180 3 , 

221, s . Likewise, considering the four B ports M 55 -M J98 , 180 5 , and 180 7 on the director board 190,. 

each one of the four B ports M^M^ can be coupled to the Referring again to FIG. 8C, the exemplary switch 406A 

B port of any one of the memory arrays through the logic includes a plurality of, here four, switch sections 430,-430 4 . 

network 221^. Thus, considering port M B1 , such port can be 40 Each one of the switch sections 430j-430 4 is identical in 

coupled to the B port of the four memory arrays. Thus, there construction and is coupled between a corresponding one of 

are two paths data and control from either a front-end the input/output director-side ports D,-D 4 and a correspond- 

director 180 3 -180 32 or a back-end director 200,-200 32 can ing one of the output/input memory-side ports Mj-M 4 , 

reach each one of the four memory arrays on the memory respectively, as shown. (It should be understood that the 

board. Thus, there are eight sets of redundant ports on a 45 output/input memory-side ports of switch 406B (FIG. 7) are 

memory board, i.e., ports M^„ M m ; M A2 , M^; M^ 3 , M B3 ; designated as ports M^-M 8 , as shown. It should also be 

M B4 ; M^ 5 , M B5 ; M A6 , M B6 ; M A7f M B7 ; and M^ 8 , M B8 . understood that while switch 406 A is responsive to request 

Further, as noted above each one of the directors has a pair signals on busses R A1 -R A4 from quad port controller 404 in 

of redundant ports, i.e. a 402A port and a 402 B port (FIG. directors 180,, 180 3 , 180 5 , 180 7 (FIG. 7), switch 406B is 

7). Thus, for each pair of redundant directors, the Aport (i.e., 50 responsive in like manner to request signals on busses 

port 402A) of one of the directors in the pair is connected to R^ S -R B4 from controller 404 in directors 180, , 180 3 , 180 $ 

one of the pair of redundant memory ports and the B port and 180 7 ). More particularly, controller 404 of director IS0 1 

(i.e., 402B) of the other one of the directors in such pair is produces request signals on busses R^, or R Bl . In like 

connected to the other one of the pair of redundant memory manner, controller 404 of director 180 3 produces request 

P orts - 55 signals on busses R A2 or R^, controller 404 of director 180 5 

More particularly, referring to FIG. 8B, an exemplary pair produces request signals on busses R A3 or R B3f and control- 

of redundant directors is shown, here, for example, front-end ler 404 of direction 180 7 produces request signals on busses 

director 180j and front end-director 180 2 . It is first noted that R A4 or R B4 . 

the directors 180,, 180 2 in each redundant pair of directors Considering exemplary switch section 430,, such switch 

must be on different director boards, here boards 190 Jf 190 2 , 60 section 403 a is shown in FIG. 8C to include a FIFO 432 fed 

respectively. Thus, here front-end director boards 190,-190 s by the request signal on bus R M . (It should be understood 

have thereon: front-end directors 180,, 180 3 , 180 5 and 180 7 ; that the FIFOs, not shown, in switch sections 430 2 ^*30 4 are 

front-end directors 180 2 , 180 4( 180 6 and 180 8 ; front end fed by request signals R A2 -R A4 , respectively). The switch 

directors 180 9 , 180,,, 180 13 and 180 J5 ; front end directors section 406 a also includes a request generation 434, and 

180, o , 180, 2 , 180 14 and 180 J6 ; front-end directors 180 J7 , 65 arbiter 436, and selectors 442 and 446, all arranged as 

180 19 , 1802, and ISO^; front-end directors 180 18 , lSO^, shown. The data at the memory-side ports M,-M 4 are on 

" J ^ °°24i front-end directors 180^, 180 27 , 180 29 and busses DM1-DM4 are fed as inputs to selector 446. Also fed 
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to selector 446 is a control signal produced by the request 
generator on bus 449 in response to the request signal R AJ 
stored in FIFO 432. The control signal on bus 449 indicates 
to the selector 446 the one of the memory-side ports Mj-M 4 
which is to be coupled to director-side port D r The other 5 
switch sections 430 2 -430 4 operate in like manner with 
regard to director-side ports D,-D 4 , respectively and the 
memory -side ports Mj-M 4 . 

It is to be noted that the data portion of the word at port 
D, (i.e., the word on bus DDI) is also coupled to the other 10 
switch sections 430 2 -430 4 . It is further noted that the data 
portion of the words at ports D 2 -D 4 (i.e., the words on 
busses DD2-DD4, respectively), are fed to the switch sec- 
tions 430j-430 4 , as indicated. That is, each one of the switch 
sections 430 3 -430 4 has the data portion of the words on 15 
ports Dj-D 4 (i.e., busses DD1-DD4), as indicated. It is also 
noted that the data portion of the word at port Mj (i.e., the 
word on bus DM1) is also coupled to the other switch 
sections 430 2 -430 4 . It if further noted that the data portion 
of the words at ports M 2 -M 4 (i.e., the words on busses 20 
DM2-DM4, respectively), are fed to the switch sections 
430 2 -430 4 , as indicated. That is, each one of the switch 
sections 430j-430 4 has the data portion of the words on 
ports Mj-M 4 (i.e., busses DM1-DM4), as indicated. 

As will be described in more detail below, a request on 2 5 
bus R A1 to switch section 430 1 is a request from the director 
180, which identifies the one of the four ports Mj-M 4 in 
switch 430, is to be coupled to port 402A of director 180, 
(director side port D,). Thus, port 402A of director 180, may 
be coupled to one of the memory side ports M 3 -M 4 selec- 30 
tivcly in accordance with the data on bus R^. Likewise, a 
request on buses R A2 , R^ 3 , R A4 to switch section 430 2 -430 4 , 
respectively, are requests from the directors 180 3 , 180 5 , and 
180 7 , respectively, which identifies the one of the four ports 
M,-M 4 in switch 430j-430 4 is to be coupled to port 402A 35 
of directors 180 3 , 180 5 and 180 7 , respectively. 

More particularly, the requests R A , are stored as they are 
produced by the quad port RAM controller 440 (FIG. 7) in 
receive FIFO 432. The request generator 434 receives from 
FIFO 432 the requests and determines which one of the four 40 
memory-side ports M,-M 4 is to be coupled to port 402Aof 
director 180,. These requests for memory-side ports M J -M 4 
are produced on lines RA1,1-RA1,4, respectively. Thus, in 
line RA1,1 (i.e., the request for memory side port M a ) is fed 
to arbiter 436 and the requests from switch sections 45 
430 2 -430 4 (which are coupled to port 402A of directors 
I8O3, 180 5 , and 180 7 ) on line RA2,1, RA3,1 and RA4,1, 
respectively are also fed to the arbiter 436, as indicated. The 
arbiter 436 resolves multiple requests for memory -side port 
M, on a first come- first serve basis. The arbiter 436 then 50 
produces a control signal on bus 435 indicating the one of 
the directors 180 3 , 180 3 , 180 5 or 180 7 which is to be coupled 
to memory-side port M,. 

The control signal on bus 435 is fed to selector 442. Also 
fed to selector 442 are the data portion of the data at port D 19 55 
i.e., the data of data bus DDI) along with the data portion of 
the data at ports D 2 -D 4 , i.e., the data on data busses 
DD2-DD4, respectively, as indicated. Thus, the control 
signal on bus 435 causes the selector 442 to couple to the 
output thereof the data busses DD1-DD4 from the one of the 60 
directors 180 3 , 180 3 , 180 5 , 180 7 being granted access to 
memory-side port M a by the arbiter 436. The selected 
outputs of selector 442 is coupled to memory-side port M,. 
It should be noted that when the arbiter 436 receives a 
request via the signals on lines RA1,1, RA2,1, RA3.1 and 65 
RA4.1, acknowledgements are returned by the arbiter 436 
via acknowledgement signals on line AK1,1, Akl ,2, AK13, 



AK1,4, respectively such signals being fed to the request 
generators 434 in switch section 430,, 430 2 , 430 3 , 430 4 , 
respectively. 

Thus, the data on any port D a -D 4 can be coupled to and 
one of the ports Mj-M 4 to effectuate the point-to-point data 
paths Pj-P^ described above in connection with FIG. 2. 

Referring again to FIG. 7, data from host computer 120 
(FIG. 2) is presented to the system interface 160 (FIG. 2) in 
batches from many host computer processors 121 ,-121 32 . 
Thus, the data from the host computer processors 
121,-12132 are interleaved with each other as they are 
presented to a director 180,-1 80 32 . The batch from each 
host computer processor 180,-180 32 (i.e., source) is tagged 
by the protocol translator 400. More particularly by a 
Tacheon ASIC in the case of a fibre channel connection. The 
controller 404 has a look-up table formed during initializa- 
tion. As the data comes into the protocol translator 400 and 
is put into the quad port RAM 420 under the control of 
controller 404, the protocol translator 400 informs the con- 
troller that the data is in the quad port RAM 420. The 
controller 404 looks at the configuration of its look-up table 
to determine the global cache memory 220 location (e.g., 
cache memory board 220,-220 8 ) the data is to be stored 
into. The controller 404 thus produces the request signals on 
the appropriate bus R A19 R S1 , and then tells the quad port 
RAM 402 that there is a block of data at a particular location 
in the quad port RAM 402, move it to the particular location 
in the global cache memory 220. The crossbar switch 318 
also takes a look at what other controllers 404 in the 
directors 180 3 , 180 5 , and 180 7 on that particular director 
board 190-, are asking by making request signal on busses 
R *2» R *2» R A3» R /?3» R A4» r /m> respectively. The arbitration 
of multiple requests is handled by the arbiter 436 as 
described above in connection with. FIG. 8C. 

Referring again to FIG. 7, the exemplary director IS0 1 is 
shown to include in the message engine/CPU controller 314. 
The message engine/CPU controller 314 is contained in a 
field programmable gate array (FPGA). The message engine 
(ME) 315 is coupled to the CPU bus 317 and the DMA 
section 408 as shown. The message engine (ME) 315 
includes a Direct Memory Access (DMA) section 408, a 
message engine (ME) stale machine 410, a transmit buffer 
424 and receive buffer 424, & MAC packetizer/depacketizer 
428, send and receive pointer registers 420, and a parity 
generator 321. The DMA section 408 includes a DMA 
transmitter 418, shown and to be described below in detail 
in connection with FIG. 9, and a DMA receiver 424, shown 
and to be described below in detail in connection with FIG. 
10, each of which is coupled to the CPU bus interface 317, 
as shown in FIG. 7. The message engine (ME) 315 includes 
a transmit data buffer 422 coupled to the DMA transmitter 
418. a receive data buffer 424 coupled to the DMA receiver 
421, registers 420 coupled to the CPU bus 317 through an 
address decoder 401, the packetizer/de-packetizer 428, 
described above, coupled to the transmit data buffer 422, the 
receive data buffer 424 and the crossbar switch 320, as 
shown, and a parity generator 321 coupled between the 
transmit data buffer 422 and the crossbar switch 320. More 
particularly, the packetizer portion 428P is used to packet ize 
the message payload into a MAC packet (FIG. 2B) passing 
from the transmit data buffer 422 to the crossbar switch 320 
and the de-packetizer portion 428D is used to de-packetize 
the MAC packet into message payload data passing from the 
crossbar switch 320 to the receive data buffer 424. The 
packetization is here performed by a MAC core which 
builds a MAC packet and appends to each message such 
things as a source and destination address designation indi- 
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eating the director sending and receiving the message and a write pointer register 454 via the register decoder 401. Thus, 
cyclic redundancy check (CRQ, as described above. The the contents of the send write pointer register 454 indicates 
message engine (ME) 315 also includes: a receive write the number of messages in the send queue 312S of RAM 
pointer 450, a receive read pointer 452; a send write pointer 312, which have not been sent. The state machine 410 
454, and a send read pointer 456. 5 cnecks ^ ^ write poimcr fCgister 454 amJ tQe ^ rea(J 
Referring now to FIGS. 11 and 12, the transmission of a pointer register 456, Step 518. As noted above both the send 
message from a director 180,-18032, 200,-200 32 and the write pointer register 454 and the send read pointer register 
reception of a message by a director 210,-21032, here 456 are initially reset to zero during power-up. Thus, if the 
exemplary director 180, shown in FIG. 7) will be described. send read pointer register 456 and the send write pointer 
Considering first transmission of a message, reference is i 0 register 454 are different, the state machine knows that there 
made to FIGS. 7 and U. First, as noted above, at power-up is a message is in RAM 312 and that such message is ready 
the controller 306 (FIG. 5) of both message network boards for transmission. If a message is to be sent, the state machine 
304 lf 304 2 initialize the message routing mapping described 410 initiates a transfer of the stored 64-byte descriptor to the 
above for the switches 308,-308 4 in switch section 308 and message engine (ME) 315 via the DMA transmitter 418 
for the crossbar switches 320. As noted above, a request is ]5 FIG. 7 (Steps 520, 522). The descriptor is sent from the send 
made by the host computer 120. The request is sent to the queues 312S in RAM 312 until the send read pointer 456 is 
protocol translator 400. The protocol translator 400 sends equal to the send write pointer 454 
the request to the ^mp^ssor 299 via CPU bus 317 and As described above in connection with Step 510, the CPU 
buffer 301. When the CPU 310 (FIG. 7) m the micropro- 310 generates a destination vector indicating the director, or 
cessor 299 of exemplary director 180, determines that a 2 o directors, which are to receive the message. As also indi- 
I***™ 1 !° anolher one of the direc!ors "ted above the command field is 32-bytes, eight bytes 
1802-18032, 200,-20032, (e g-, the CPU 310 determines that thereof having a bit representing a corresponding one of the 
there has been a *imss in the global cache memory 220 64 directors to receive the message. For example, referring 
(FIG. 2) and wants to send a message to the appropriate one t 0 FIG. 11C, each of the bit positions 1-64 represents 
of the back^nd directors 200,-20032, as described above in 25 directors 180,-18032, 200,-200^, respectively. Here, in this 
connection with FIG. 2), the CPU 310 builds a 64 byte example, because a logic 1 is only in bit position 1, the 
descriptor (FIG. 2A) which includes a 32 byte message eight-byte vector indicates that the destination director is 
payload indicating the addresses of the batch of data to be only front-end director 108.. In the example in FIG 11D 
read from the bank of disk drives 140 (FIG. 2) (Step 500) because a logic 1 is only in bit position 2, the eight-byte* 
and a 32 byte command field (Step 510) which indicates the 30 vector indicates that the destination director is ooly front- 
message destination via an 8-byte bit vector, i.e., the end director 108,. In the example in FIG. HE, because a 
director, or directors, which are to receive the message. An logic 1 is more than one bit position, the destination for the 
8-byte portion of the command field indicates the director or message is to more than one director, i.e., a multi-cast 
du-ectors, which are to receive the message. That is, each one message. In the example in FIG. 11E, a logic 1 is only in bit 
of the 64 bus in the 8-byte portion corresponds to one of the 35 positions 2, 3, 63 and 64. Thus, the eigbt-byte vector 
64 directors. Here, a logic 1 in a bit indicates that the indicates that the destination directors are only front-end 
corresponding director is to receive a message and a logic 0 director 108 2 and 108 3 and back-end directors 200 31 and 
indicates that such corresponding director is not to receive 200 32 . There is a mask vector stored in a register of register 
the message. T^us, ,f the 8-byte word has more than one section 420 (FIG. 7) in the message engine (ME) 315 which 
logic 1 bit more than one director will receive the same 40 identifies director or directors which may be not available to 
message. As will be described, the same message will not be use (e.g. a defective director or a director not in the system 
sent in parallel to all such directors but rather the same at that time), Step 524, 525, for a uni<ast transmission) If 
message will be sent sequentially to all such directors. In any the message engine (ME) 315 state machine 410 indicates 
^K^J^^^^J^^ ' S e encrated °y lhc that the director is available by examining the transmit 
CPU 310 (FIG. 7) (Step 512) is written into the RAM 312 45 vector mask (FIG. 11F) stored in register 420, the message 
(Step 514), as shown in FIG. 11. eDgine (ME) 315 encapsulates the message payload with a 
More particularly, the RAM 512 includes a pair of queues; MAC header and CRC inside the packetizer portion 428P, 
a send queue and a receive queue, as shown in FIG. 7. The discussed above (Step 526). An example of the mask is 
RAM 312 is coupled to the CPU bus 317 through an Error shown in FIG. 11F. The mask has 64 bit positions, one for 
Detection and Correction (EQAQ/Memory control section 50 each one of the directors. Thus, as with the destination 
303, as shown. The CPU 310 then indicates to the message vectors described above in connection with FIGS. 11C-11E, 
engine (ME) 315 state machine 410 (FIG. 7) that a descrip- bit positions 1-64 represents directors 180J-18032, 
lor has been written into the RAM 312. It should be noted 200,-200 3 2, respectively. Here in this example, a logic 1 in 
that the message engine (ME) 315 also includes: a receive a bit position in the mask indicates that the representative 
write pointer or counter 450, the receive read pointer or 55 director is available and a logic 0 in such bit position 
counter 452, the send write pointer or counter 454, and the indicates that the representative director is not available 
send read pointer or counter 454, shown in FIG. 7. All four Here, in the example shown in FIG. HF, only director 200 32 
pointers 450, 452, 454 and 456 are reset to zero on power- is unavailable. Thus, if the message has a destination vector 
up. As is also noted above, the message engine/CPU con- as indicated in FIG. 11E, the destination vector, after passing 
troller 314 also includes: the de-packetizer portion 428D of 60 through the mask of FIG. 11F modifies the destination vector 
packelizer/de-packetizer 428, coupled to the receive data to that shown in FIG. 11G. Thus, director 200 32 will not 
buffer 424 (FIG. 7) and a packetizer portion 428P of the receive the message. Such mask modification to the desti- 
packetizer/de-packetizer 428, coupled to the transmit data nation vector is important because, as will be described, the 
buffer 422 (FIG. 7). Thus, referring again to FIG. 11, when messages on a multi-cast are sent sequentially and not in 
the CPU 310 indicates that a descriptor has been written into 65 parallel. Thus, elimination of message transmission to an 
the RAM 312 and is now ready to be sent, the CPU 310 unavailable director or directors increases the message trans- 
increments the send write pointer and sends it to the send mission efficiency of the system. 
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Having packetized the message into a MAC packet via the S in the crossbar switch 320. The selector S is responsive to 

packetizer portion of the packetizer/de-packetizer 428 (FIG. the bit B in the header of the MAC packet (FIG. 2B) and, 

7), the message engine (ME) 315 transfers the MAC packet when such bit B is one logic state the data is coupled to one 

to the crossbar switch 320 (Step 528) and the MAC packet of the message networks boards 402A and in response to the 

is routed to the destination by the message network 260 5 opposite logic state the data is coupled to the other one of the 

(Step 530) via message network boards 304 a , 304 2 or on the message networks boards 402B. That is, when one message 

same director board via the crossbar switch 320 on such is transmitted to board 304. the next message is transmitted 

director board. (0 3^ 

Referring to FIG. 12, the message read operation is Referring again to FIG. 9, the details of an exemplary 

desm^ p^in.Step ;6(H),the oorector waits for a mes- 30 transmit DMA 418 is shown. As noted above, after a 

sage. When a message is received, the message engine (ME) decriptor has been created by the CPU 310 (FIG 7) and is 

315 slate machine '410 receives the packet (Step 602). The then stored in the RAM 312. If the send write pointer 450 

state machine 410 checks the receive bit vector mask (FIG. (FIG. 7) and send read pointer 452, described above, have 

11 stored in register 426) against the source address of the different counts an indication is provided by the slate 

packet (Step 604). If the state machine 410 determines that 1S machine 410 in the message engioe (ME) 315 (FIG. 7) that 

the message is from an improper source (i.e., ; a faulty the created descriptor is available for DMA transmission to 

director as indicated in the mask, FIG. 11F, for example), the the message engine (ME) 315, the payload off the descriptor 

packet is discarded (Step 606). On the other band, if the state is packetized into a MAC packet and sent through the 

machine 410 determines that the packet is from a proper or message network 360 (FIG. 2) to one or more directors 

valid director (i.e., source), the message engine (ME) 315 20 180,-18032, 200,-200^. More particularly, the descriptor 

de-encapsulates the message from the packet (Step 608) in created by the CPU 310 is first stored in the local cache 

de-packetizer 428D. The state machine 410 in the message memory 319 and is later transferred to the send queue 312S 

engine (ME) 315 initiates a 32-byte payload transfer via the in RAM 312. When the send write pointer 450 and send read 

DMA receive operation (Step 610). The DMA writes the 32 pointer 452 have different counts, the message engine (ME) 

byte message to the memory receive queue 313R in the 25 315 state machine 410 initiates a DMA transmission as 

RAM 312 (Step 612). The message engine (ME) 315 stale discussed above in connection with Step 520 (FIG 11) 

machine 410 then increments the receive write pointer Further, as noted above, the descriptor resides in send 

register 450 (Step 614). The CPU 310 then checks whether queues 312R within the RAM 312. Further, as noted above, 

the receive write pointer 50 is equal to the receive read each descriptor which contains the message is a fixed size, 

pointer 452 (Step 616). If they are equal, such condition 30 here 64-bytes. As each new, non- transmitted descriptor is 

indicates to the CPU 310 that a message has not been created by the CPU 310, it is sequentially stored in a 

received (Step 618). On the other hand, if the receive write sequential location, or address in the send queue 312S. Here, 

pointer 450 and the receive read pointer 452 are not equal, the address is a 32-bit address 

such condition indicates to the CPU 310 that a message has When thc transmil DMA is initialed> lfac sUle machiQe 

been received and the CPU 310 processes the message in the 35 410 in the message engine (ME) 315 (FIG. 7), sends the 

receive queue 314R of RAM 312 and then the CPU 310 queue address on tus 411 to an address register 413 in the 

increments the receive read pointer and writes it into the DMA transmitter 418 (FIG. 9) along with a transmit write 

receive read pointer register 452 Thus messages are stored enabIe signal Tx _ WE si al ^ DMA transmitter 418 

in the receive queue 312R of RAM 312 until the contents of requesls the CPU bus 317 by asserting a signal on Xmit Br. 

the receive read pointer 452 and the contents of the receive 40 CPU Dus arbiter 414 (FIG ^ perforrns a 5us arbilration 

write pointer 450 which are initialized to zero during and when mmpiizle the arbiter 414 grants the DMA 

power-up, are equal. transmitter 418 access to the CPU bus 317. The Xmit Cpu 

Referring now to FIG. 13, the acknowledgement of a state machine 419 then places the address currently available 

message operation is described. In Step 700 the receive i n the address register 413 on the Address bus portion 317A 

DMA engine 420 successfully completes a message transfer 45 of CPU bus 317 by loading the output address register 403 

to the receive queue in RAM 312 (FIG. 7). The state Odd parity is generated by a Parity generator 405 before 

machine 410 in the message engine (ME) 315 generates an loading the output address register 403. The address in 

acknowledgement MAC packet and transmits the MAC register 403 is placed on the CPU bus 317 (FIG 7) for RAM 

^ C Vw tbe <KreCl0r Via ^ messa £ e Detwork 260 312 send queue 312S, along with appropriate read control 

(FIG. 2) (Steps 702, 704). The message engine (ME) 315 at 50 signals via CPU bus 317 portion 317C. The data at the 

the sending director de-encapsulates a 16 byte status payload address from the RAM 312 passes, via the data bus portion 

in the acknowledgement MAC packet and transfers such 317D of CPU bus 317, through a parity checker 415 to a data 

status payload via a receive DMA operation (Step 706). The input register 417. The control signals from the CPU 310 are 

DMA of the sendmg (i.e., source) director writes to a status fed to a Xmit CPU state machine 419 via CPU bus 317 bus 

??ic°L the dC ^ pt0r WilhiD lhC RAM memor y SCDd ^ ueuc 55 P° rlion 317 C One of the control signals indicates whether 

314S (Step 708).The state machine 410 of the message the most recent copy of the requested descriptor is in the 

engine (ME) 315 of the sending director (which received the send queue 312S of the RAM 312 or still resident in the local 

acl^owledgement message) increments its send read pointer cache memory 319. That is, the most recent descriptor at any 

454 (Step 712). The CPU 310 of the sending director (which given address is first formed by the CPU 310 in the local 

received the acknowledgement message) processes the 6 o cache memory 319 and is later transferred by the CPU 310 

descriptor status and removes the descriptor from the send lo the queue in the RAM 312. Thus, there may be two 

queue 312S of RAM 312 (Step 714). It should be noted that descriptors with the same address; one in the RAM 312 and 

the send and receive queues 312S and 312R are each circular one in the local cache memory 319 (FIG. 7), the most recent 

queues one being in the local cache memory 319. In either event, the 

As noted above, the MAC packets are, during normal 65 transmit DMA 418 must obtain the descriptor for DMA 

operation, transmitted alternatively to one of the pair of transmission from the RAM 312 and this descriptor is stored 

message network boards 304,, 304 2 by hardware a selector in the transmit buffer register 421 using signal 402 produced 
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by the state machine 419 to load these registers 421. The 
control signal from the CPU 310 to the Xmit CPU state 
machine 419 indicates whether the most recent descriptor is 
in the local cache memory 319. If the most recent descriptor 
is in the local cache memory 319, the Xmit CPU state 5 
machine 419 inhibits the data that was just read from send 
queue 312S io the RAM 312 and which has been stored in 
register 421 from passing to selector 423. In such case, state 
machine 419 must perform another data transfer at the same 
address location. The most recent message is then trans- 10 
ferred by the CPU 310 from the local cache memory 319 to 
the send queue 312S in the RAM 312. The transmit message 
state m achine 419 then re-arbitrates for the CPU bus 317 and 
after it is granted such CPU bus 317, the Xmit CPU state 
machine 419 then reads the descriptor from the RAM 312. 15 
This time, however, there the most recent descriptor is 
available in the send queue 312s in the RAM 312. The 
descriptor in the RAM 312 is now loaded into the transmit 
buffer register 421 in response to the assertion of the signal 
402 by the Xmit CPU state machine 419. The descriptor in 20 
the register 421 is then transferred through selector 423 to 
message bus interface 409 under the control of a Xmit 
message (msg) state machine 427. That is, the descriptor in 
the transmit buffer register 421 is transferred to the transmit 
data buffer 422 (FIG. 7) over the 32 bit transmit message bus 25 
interface 409 by the Xmit message (msg) state machine 427. 
The data in the transmit data buffer 422 (FIG. 7) is pack- 
etized by the packetizer section of the packetizer/de- 
packetizer 428 as described in Step 530 in FIG. 11. 

More particularly, and referring also to FIG. 14A, the 30 
method of operating the transmit DMA 418 (DIG. 9) is 
shown. As noted above, each descriptor is 64-byte. Here, the 
transfer of the descriptor takes place over two interfaces 
namely, the CPU bus 317 and the transmit message interface 
bus 409 (FIG. 7). The CPU bus 317 is 64 bits wide and eight, 35 
64-bit double-words constitute a 64-byte descriptor. The 
Xmit CPU state machine 419 generates the control signals 
which result in the transfer of the descriptor from the RAM 
312 into the transmit buffer register 421 (FIG. 7). The 
64-byte descriptor is transferred in two 32-byle burst 40 
accesses on the CPU bus 317. Each one of the eight double 
words is stored sequentially in the transmit buffer register 
421 (FIG. 9). Thus, in Step 800, the message engine 315 
state machine 410 loads the transmit DMA address register 
413 with the address of the descriptor to be transmitted in the 45 
send queue 312S in RAM 312. This is done by the asserting 
the Tx_WE signal and thus puts Xmit CPU state machine 
419 in step 800, loads the address register 413 and proceeds 
to step 802. In step 802, the Xmit Cpu state machine 419 
loads the CPU transfer counter 431 (FIG. 9) with a 32-byte 50 
count, which is 2. This is the number of 32 byte transfers that 
would be required to transfer the 64-byte descriptor, Step 
802. The Xmit Cpu state machine 419 now proceeds to Step 
804. In step 804, the transmit DMA state machine 419 
checks the validity of the address that is loaded into its 55 
address register 413. The address loaded into the address 
register 413 is checked against the values loaded into the 
memory address registers 435. The memory address regis- 
ters 435 contain the base address and the offset of the send 
queue 3125 in the RAM 312. The sum of the base address 60 
and the offset is the range of addresses for the send queue 
312S in RAM 312. The address check circuitry 437 con- 
stantly checks whether the address in the address register 
413 is with in the range of the send queue 312S in the RAM 
312. If the address is found to be outside the range of the 65 
send queue 312S the transfer is aborted, this status is stored 
in the status register 404 and then passed back to the 



message engine 315 state machine 410 in Step 416. The 
check for valid addresses is done in Step 805. If the address 
is within the range, i.e., valid, the transmit DMA slate 
machine 419 proceeds with the transfer and proceeds to Step 
806. In the step 806, the transmit DMA state machine 419 
requests the CPU bus 317 by asserting the Xmit_BR signal 
to the arbiter 414 and then proceeds to Step 807. In Step 807, 
the Xmit Cpu state machine 419 constantly checks if it has 
been granted the bus by the arbiter. When the CPU bus 317 
is granted, the Xmit CPU state machine proceeds to Step 
808. In Step 808, the Xmit Cpu state machine 419 generates 
an address and a data cycle which essentially reads 32-bytes 
of the descriptor from the send queue 312S in the RAM 312 
into its transmit buffer register 421. The Xmit Cpu state 
machine 419 now proceeds to step 810. In Step 810, the 
Xmit Cpu state machine 419 loads the descriptor that was 
read into its buffer registers 421 and proceeds to Step 811. 
In Step 811, a check is made for any local cache memory 319 
coherency errors (i.e., checks whether the most recent data 
is in the cache memory 319 and not in the RAM 312) on 
these 32-bytes of data. If this data is detected to be resident 
in the local CPU cache memory 319, then the Xmit Cpu state 
machine 419 discards this data and proceeds to Step 806. 
The Xmit Cpu state machine 419 now requests for the CPU 
bus 317 again and when granted, transfers another 32-bytes 
of data into the transmit buffer register 421, by which time 
the CPU has already transferred the latest copy of the 
descriptor into the RAM 312. In cases when the 32-bytes of 
the descriptor initially fetched from the RAM 312 was not 
resident in the local CPU cache memory 319 (i.e., if no 
cache coherency errors were detected), the Xmit Cpu state 
machine 419 proceeds to Step 812. In Step 812, the Xmit 
CPU slate machine 419 decrements counters 431 and incre- 
ments the address register 413 so that such address register 
413 points to the next address. The Xmit Cpu slate machine 
then proceeds to step 814. When in Step 814, the Transmit 
CPU state machine 419 checks to see if the transfer counter 
431 has expired, i.e., counted to zero, if the count was found 
to be non-zero, it then, proceeds to Step 804 to start the 
transfer of the next 32-bytes of the descriptor. In case the 
counter 431 is zero, the process goes to Step 816 to complete 
the transfer. The successful transfer of the second 32-bytes 
of descriptor from the RAM 312 into the transmit DMA 
buffer register 421 completes the transfer over the CPU bus 
317. 

The message interface 409 is 32 bits wide and sixteen, 32 
bit words constitute a 64-byte descriptor. The 64-byte 
descriptor is transferred in batches of 32 bytes each. The 
Xmit msg state machine 427 controls and manages the 
interface 409. The Xmit Cpu state machine asserts the signal 
433 to indicate that the first 32 bytes have been successfully 
transferred over the CPU bus 317 (Step 818, FIG. 14B), this 
puts the Xmit msg stale machine inlo Step 818 and starts the 
transfer on the message interface. In step 820, the Xmit msg 
machine 427 resets burst/transfer counters 439 and initiates 
the transfer over the message interface 409. In Step 820, the 
transfer is initiated over the message interface 409 by 
asserting the "transfer valid" (TX _DAE\_Valid) signal 
indicating to the message engine 315 state machine 410 that 
valid data is available on the bus 409. The transmit msg 
machine 427 transfers 32 bits of data on every subsequent 
clock until its burst counter in burst/transfer counter 439 
reaches a value equal to eight, Step 822. The burst counter 
in burst/transfer counter 439 is incremented with each 32-bit 
word put on the message bus 409 by a signal on line 433. 
When the burst count is eight, a check is made by the state 
machine 427 as to whether the transmit counter 431 has 
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expired, i.e., is zero, Step 824. The expiry of the transfer 
counter in burst/transfer counter 439 indicates the 64 byte 
descriptor has been transferred to the transmit buffer 422 in 
message engine 315. If it has expired, the transmit message 
state machine 427 proceeds to Step 826. In step 826, the 5 
Xmit msg state machine asserts the output End of Transfer 
(Tx_EOT) indicating the end of transfer over the message 
bus 409. In this state, after the assertion of the Tx__EOT 
signal the status of the transfer captured in the status register 
404 is sent to the message engine 315 state machine 410. 10 
The DMA operation is complete with the descriptor being 
stored in the transmit buffer 422 (FIG. 7). 

On the other hand, if the transfer counter in burst/transfer 
counter 439 has not expired, the process goes to Step 800 
and repeats the above described procedure to transfer the 2 nd 15 
32 bytes of descriptor data, at which time the transfer will be 
complete. 

Referring now to FIG. 10, the receiver DMA 420 is 
shown. Here, a message received from another director is to 
be written into the RAM 312 (FIG. 7). The receive DMA 20 
420 is adapted to handle three types of information: error 
information which is 8 bytes in size; acknowledgement 
information which is 16 bytes in size; and receive message 
payload and/or fabric management information which is 32 
byes in size. Referring also to FIG. 7, the message engine 
315 slate machine 410 asserts the Rx_WE signal, indicating 
to the Receive DMA 420 that is ready transfer the Data in its 
Rec buffer 416 FIG. 7. The data in the Receive buffer could 
be the 8-byte error information, the 16-byte Acknowledge- 
ment information or the 32-byte Fabric management/ 
Receive message payload information. It places a 2 bit 
encoded receive transfer count, on the Rx_transfer count 
signal indicating the type of information and an address 
which is the address where this information is to be stored 
in the receive queue of RAM 312. In response to the receive 
write enable signal Rx_WE, the Receive message machine 
450 (FIG. 10) loads the address into the address register 452 
and the transfer count indicating the type of information, 
into the receive transfer counter 454. The address loaded 
into the address register 452 is checked by the address check 
circuitry 456 to see if it is with in the range of the Receive 
queue addresses, in the RAM 312. This is done by checking 
the address against the values loaded into the memory 
registers 457 (i.e., a base address register and an offset 
register therein). The base address register contains the start 
address of the receive queue 312R residing in the RAM 312 
and the offset register contains the size of this receive queue 
312R in RAM 312. Therefore the additive sum of, the values 
stored in the base address register and the offset register 
specifies the range of addresses of the receive queue in the 
RAM 312 R. The memory registers 457 are loaded during 
initialization. On the subsequent clock after the assertion of 
the Rx_WE signal, the message engine 315 slate machine 
410 the proceeds to place the data on a 32-bit message 
engine 315 data bus 407, FIG. 10. A Rx_data_valid signal 
accompanies each 32 bits of data, indicating that the data on 
the message engine data bus 407 is valid. In response to this 
Rx_data_valid signal the receive message state machine 
450 loads the data on the data bus into the receive buffer 
register 460. The end of the transfer over the message engine 
data bus 407rf is indicated by the assertion of the Rx_EOT 
signal at which time the Receive message state machine 450 
loads the last 32 bits of data on the message engine data bus 
407D of bus 407, into the receive buffer registers 460. This 
signals the end of the transfer over the message engine data 
bus 407D portion of bus 407. At the end of such transfer is 
conveyed to the Rx_Cpu slate machine 462 by the assertion 
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of the signal 464. The Receive CPU machine 462 now, 
requests for the CPU bus 317 by asserting the signal 
REC_Br. After an arbitration by CPU bus arbiter 414 (FIG. 
7) the receive DMA 420 (FIG. 10) is given access to the 
CPU bus 317. The Receive CPU state machine 462 proceeds 
to transfer the data in its buffer registers 424 over the CPU 
bus 317 into the Receive queue 312R in the RAM 312. 
Simultaneously, this data is also transferred into a duplicate 
buffer register 466. The data at the output of the receive 
buffer register 460 passes to one input of a selector 470 and 
also passes to a duplicate data receive buffer register 460. 
The output of the duplicate receive buffer register 466 is fed 
to a second input of the selector 470. As the data is being 
transferred by the Receive CPU state machine 462, it is also 
checked for cache coherency errors. If the data correspond- 
ing to the address being written into the RAM 312, is located 
in the CPU's local cache memory 319 (FIG. 7), the receive 
DMA machine 420 wails for the CPU 310 to copy the old 
data in its local cache memory 319 back to the receive queue 
312R in the RAM 312 and then overwrites this old data with 
a copy of the new data from the duplicate buffer register 466. 

More particularly, if central processing unit 310 indicates 
to the DMA receiver 420 that the data the receive buffer 
register 460 is available in the local cache memory 319, the 
25 receive CPU slate machine 462 produces a select signal on 
line 463 which couples the da la in the duplicate buffer 
register 466 to the output of selector 470 and then to the bus 
317 for store in the random access memory 312. 
The successful write into the RAM 312 completes the DMA 
30 transfer The receive DMA 420 then signals the message 
engine 315 state machine 410 on the status of the transfer. 
The status of the transfer is captured in the status register 
459. 

Thus, with both the receive DMA and the transmit DMA, 
35 there is a checking of the local cache memory 319 to 
determine whether it has "old" data, in the case of the 
receive DMA or whether it has "new data" in the case of the 
transmit DMA. . 

Referring now to FIG. 15 A, the operation of the receive 
40 DMA 420 is shown. Thus, in Step 830 the Receive message 
machine 450 checks if the write enable signal Rx_WE is 
asserted. If found asserted, the receive DMA 420 proceeds 
to load the address register 452 and the transfer counter 454. 
The value loaded into the transfer counter 454 determines 
45 the type of DMA transfer requested by the Message engine 
state machine 310 in FIG. 7. The assertion of the 
Rx_DATA_ VALID signal is asserted. If asserted it pro- 
ceeds to step 836. The Rx msg state machine loads the buffer 
register 460 (FIG. 19) in Step 836 with the data on the 
50 message engine data bus 407D of bus 407 FIG. 10. The 
Rx_DATA^_VALID signal accompanies each piece of data 
put on the bus 407. The data is sequentially loaded into the 
buffer registers 460 (FIG. 10). The End of the transfer on the 
message engine data bus 407D of bus 407 is indicated by the 
55 assertion of the Rx_EOT signal. When the Receive message 
state machine 450 is in the End of transfer state Step 840 it 
signals the Receive CPU state machine 462 and this starts 
the transfer on the CPU bus 317 side. 

The flow for the Receive CPU state machine is explained 
60 below. Thus, referring to FIG. 15B, the End of the transfer 
on the Message engine data bus 407D portion of bus 407 
starts the Receive CPU state machine 462 and puts it in Step 
842. The Receive CPU state machine 462 checks for validity 
of the address in this state (Step 844). This is done by the 
65 address check circuitry 456. If the address loaded in the 
address register 452 is outside the range of the receive queue 
312R in the RAM 312, the transfer is aborted and the status 
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is captured in the Receive status register 459 and the Rec 
Cpu state machine 462 proceeds to Step 845. On a valid 
address the Receive CPU state machine 462 goes to Step 
846. In Step 846 the Receive Cpu state machine 462 requests 
for access of the CPU bus 31. h then proceeds to Step 848. 5 
In step 848 it checks for a grant on the bus 317. On a 
qualified grant it proceeds to Step 850. In Step 850, the Rec 
Cpu state machine 462 performs an address and a data cycle, 
which essentially writes the data in the buffer registers 460 
into the receive queue 312R in RAM 312. Simultaneously 10 
with the write to the RAM 312, the data put oa the CPU bus 
317 is also loaded into the duplicate buffer register 466. At 
same time, the CPU 310 also indicates on one of the control 
lines, if the data corresponding to the address written to in 
the RAM 312 is available in its local cache memory 319. At is 
the end of the address and da la cycle the Rec Cpu state 
machine 462 proceeds to Step 850. In this step it checks for 
cache coherency errors of the type described above in 
connection with the transmit DMA 418 (FIG. 9). If cache 
coherency error is detected and the receive CPU state 20 
machine 462 proceeds to Step 846 and retries the transaction 
more particularly, the Receive CPU state machine 462 now 
generates another address and data cycle to the previous 
address and this time the data from the duplicate buffer 466 
is pul on lo the CPU data bus 317. If there were no cache 25 
coherency errors the Receive CPU stale machine 462 pro- 
ceeds to Step 852 where it decrements the transfer counter 
454 and increment the address in the address register 452. 
The Receive Cpu state machine 462 then proceeds to Step 
854. In Step 854, the state machine 462 checks if the transfer 30 
counter has expired, i.e., is zero. On a non zero transfer 
count the receive Cpu state machine 462 proceeds lo Step 
844 and repeats the above described procedure until the 
transfer becomes zero. A zero transfer count when in step 
854 completes the write into the receive queue 312R in 35 
RAM 312 and the Rec Cpu state machine proceeds to 845. 
In step 845, it conveys status stored in the status register 
back to status is conveyed to the message engine 315 state 
machine 410. 

Referring again to FIG. 7, the interrupt control status 40 
register 412 will be described in more detail. As described 
above, a packet is sent by the pocketsize portion of the 
packet izer/de-packetizer 428 to the crossbar switch 320 for 
transmission to one or more of the directors. It is to be noted 
that the packet sent by .the packetizer portion of the 45 
packet izer/de-packetizer 428 passes through a parity gen- 
erator PG in the message engine 315 prior to passing to the 
crossbar switch 320. When such packel is sent by the 
message engine 315 in exemplary director 180 lf to the 
crossbar, switch 320, a parity bit is added to the packet by 50 
parity bit generator PG prior to passing to the crossbar 
switch 320. The parity of the packet is checked in the parity 
checker portion of a parity checker/generator (PG/C) in the 
crossbar switch 320. The result of the check is sent by the 
PG/C in the crossbar switch 320 to the interrupt control 55 
status register 412 in the director 180 r 

Likewise, when a packet is transmitted from the crossbar 
switch 320 to the message engine 315 of exemplary director 
180j, the packet passes through a parity generator portion of 
the parity checker/generator (PG/C) in the crossbar switch 60 
320 prior to being transmitted to the message engine 315 in 
director 180j : The parity of the packet is then checked in the 
parity checker portion of the parity checker (PC) in direction 
180j and is the result (i.e., status) is transmitted to the status 
register 412. 65 

A number of embodiments of the invention have been 
described. Nevertheless, it will be understood that various 
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modifications may be made without departing from the spirit 
and scope of the invention. Accordingly, other embodiments 
are within the scope of the following claims. 
What is claimed is: 

1. Asys^; interface comprising: 
a plurality, 6f ffii#o^^^ f 
a plurality of second directors; 

a datVtrju^ cache . 

D&mom^^ the plurality of first and * 

sec^dwectbrs; 

a messagirjgioetwork, operative independently of the data 
transfer section, coupled to the plurality of. first direc- 
tors and die plurality of second directors; aral 

wherein me first and second directors control data transfer 
between the first directors and the second directors in> 
response to messages passing between the first direc- 
tors and the second directors through the messaging 
network to facilitate data transfer between first direc- 
tors and the second directors with such data passing 
through the cache memory in the data transfer section; 

wherein each one of the first directors includes: 

a data pipe coupled between an input of such one of the 

first directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, 
the microprocessor, and the controller; and wherein 

the controller controls the transfer of the messages 
between the message network and such one of the 
first directors and the data between the input of such 
one of the first directors and the cache memory. 

2. The system interface recited in claim 1 wherein each 
one of the second directors includes: 

a data pipe coupled between an inpul of such one of the 

second directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, the 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

3. The system interface recited in claim 1 wherein each 
one of the controller includes a bus arbiter coupled to the 
common bus for arbitrating access to such common bus. 

4. The system recited in claim 2 wherein each one of the 
controllers in the second directors includes a bus arbiter 
coupled to the common bus for arbitrating access to such 
common bus. 

5. A data storage system for transferring data between a 
host computer/server and a bank of disks drives through a 
system interface,, such system interface comprising: ^ 

a plurality of first directors coupled to host computer/ 
server; 

a phirality/of second directors coupled to the bank of disk 
drives; 

a data transfer section having a cache memory,. such 
cache memory being coupled to the plurality of first 
and second directors; 

a messaging network, operative independently of the 
data transfer section, coupled to the plurality of first 
directors and the , plurality of second directors; and 

wherein the first and second directors control data 
transfer between the host computer and the bank of 
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disk drives in response to messages passing between 
the first directors and the second directors through 
the messaging network to facilitate the data transfer 
between host computer/server and the bank of disk 
drives with such data passing through the cache 5 
memory in the data transfer section; 
wherein each one of the first directors includes: 
a data pipe coupled between an input of such one of 

the first directors and the cache memory; 
a microprocessor; 10 
a controller; 

a common bus, such bus interconnecting the data 
pipe, the microprocessor, and the controller; and 
wherein 

the controller controls the transfer of the messages is 
between the message Detwork and such one of the 
first directors and the data between the input of 
such one of the first directors and the cache 
memory. 

6. The storage system recited in claim 5 wherein each one 20 
of the second directors includes: 

a data pipe coupled between an input of such one of the 

second directors and the cache memory; 
a microprocessor; 
a controller; 



25 



a common bus, such bus interconnecting the data pipe, (he 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 3Q 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

7. The storage system recited in claim 5 wherein each one 

of the controllers includes a bus arbiter coupled to the 35 
common bus for arbitrating access to such common bus. 

8. The storage system recited in claim 6 wherein each one 
of the controller in the second directors includes a bus arbiter 
coupled to the common bus for arbitrating access to such 
common bus. 40 

9. A method of operating a system interface having a 
plurality of first directors, a plurality of second directors and 
a data transfer section having a cache memory, such cache 
memory being coupled to the plurality of first and second 
directors, such method comprising: 



providing a messaging network, operative independently 
of the data transfer section, coupled to the plurality of 
first directors and I be plurality of second directors to 
control data transfer belween the first directors and the 
second directors in response to messages passing 
between the first directors and the second directors 
through the messaging network to facilitate data trans- 
fer between first directors and the second directors with 
such data passing through the cache memory in the data 
transfer section; and providing each one of the first 
directors with: 

a data pipe coupled between an input of such one of the 

first directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, 
the microprocessor, and the controller; and wherein 

the controller controls the transfer of the messages 
belween the message network and such one of the 
first directors and the data between the input of such 
one of the first directors and the cache memory. 

10. The method recited in claim 9 including providing 
each one of the second directors with: 

a data pipe coupled between an input of such one of the 

second directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, the 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

11. The method recited in. claim 9 including providing 
each one of the controllers with a bus arbiter coupled to the 
common bus for arbitrating access to such common bus. 

12. The method recited in claim 10 including providing 
each one of the controller in the second directors with a bus 
arbiter coupled to the common bus for arbitrating access to 
such common bus. 
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second level controllers. The first level controllers and 
the second level controllers work together such that if 
one of the second level controllers fails, the routing 
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tion of the memory devices remains constant. The in- 
vention also includes switching circuitry which permits 
a functioning second level controller to assume control 
of a group of memory devices formerly primarily con- 
trolled by the failed second level controller. In addition, 
the invention provides error check and correction as 
well as mass storage device configuration circuitry. 
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niw adda v «vCTirvf is divided into a plurality of sectors, each sector having 

ui^a akka r MMtM t h e same, predetermined size. Each sector has a particu- 

CROSS REFERENCE TO RELATED dreS s, a header Held code which allows for the detection 

5 of errors in the header field, a data field of variable 

This is a continuation of application Ser. No. length and ECC ("Error Correction Code") codes, 

07/601.482 filed Oct. 22. 1990 which is a continuation- which allow for the detection and correction of errors 

in-part of Ser. Nos. 07/505,622, 07/506,703, and in the data. 

?2£ 88 ' 749 ' flled Apr * 6 ' 1990 ' A P r 6 » I990 ' and Mar 2 > Whcn a disk * writte " t°> «he disk controller reads 

1990, respectively. 10 the header field and the header field code. If the sector 

BACKGROUND OF THE INVENTION 15 ^ d ? ired SCClor no header fieId * de * 

-« « , • ■ # . , „ lect «*. ^e new dau is written into the data field and the 

JUe present invention relates generally to memory new data ECC is written into the ECC field 
S d ™ ces - M t ore Particularly, the invention is a Read operations are similar in that initially both the 
method and apparatus for interfacing an external com- » header field and header field error code are read If no 
puter to a set of storage devices which are typically disk header field errors exist, the data ^thTdaTconel 
w j- . , * « t' on codes are read. If no error is detected the data « 

Magnetic disk drive memories for use with digital transmitted lo the comp^t^mS^^^ the 
computer systems are known. Although many types of error correction dicJ^^^^i^^^ 
disk drives are known, the present invention will be 20 , n . hp ^ r ?v 

described as using hard disk drives. However, nothing «™£ nit, c ^ ' ^ Z v * * P ° SSible ' the 
herein should be taken to limit the invention to thai SS^SnS • 'T™^* ° lhe ™*, 
particular embodiment. dnVe *f omr ° llcr s 'S na ^ to the computer or master disk 

Many computer systems use a plurality of disk drive S£ that " uncorrectabIe c ™ h» been de- 
memories to store data. A common known architecture 25 . ' . 

for such systems is shown in FIG. I. Therein, computer 2 3 known d,sk dnve s y stem which has an 

10 is coupled by means of bus 15 to disk array 20. Disk ^soc^ted error correction circuit, external to the indi- 
array 20 is comprised of large buffer 22. bus 24, and a « " c ! controIler s> »s snown. This system uses a 
plurality of disk drives 30. The disk drives 30 can be R eea-Solomon error detection code both to detect and 
operated in various logical configurations. When a 30 ? orrecl errors. Reed -Solomon codes are known and the 
group of drives is operated collectively as a logical information required to generate them is described in 
device, data stored during a write operation can be many refef ences. One such reference is Practical Error 
spread across one or more members of an array. Disk Correction Design for Engineers, published by Data Sys- 
comrollers 35 are connected to buffer 22 by bus 24. tems Technology Corp., Broomfield, Colo. For pur- 
Each controller 35 is assigned a particular disk drive 30. 35 P oses °f ,nis application, it is necessary to know that the 

Each disk drive within disk drive array 20 is accessed Reed-Solomon code generates redundancy terms, 
and the data thereon retrieved individually. The disk herein called P and Q redundancy terms, which terms 
controller 35 associated with each disk drive 30 controls are used t0 detect and correct data errors, 
the input/output operations for the-particular disk drive In tne system shown in FIG. 2, ECC 42 unit is cou- 
to which it is coupled. Data placed in buffer 22 is avail- 40 P led to bus 45. The bus is individually coupled to a 
able for transmission to computer 10 over bus 15. When plurality of data disk drives, numbered here 47, 48, and 
the computer transmits data to be written on the disks, 49 » as weM as to the P and Q term disk drives, numbered 
controllers 35 receive the data for the individual disk 51 and 53 through Small Computer Standard Interfaces 
drives from bus 24. In this type of system, disk opera- ("SCSIs") 54 through 58. The American National Stan- 
tions are asynchronous in relationship to each other. 45 dar d for Information Processing ("ANSI") has promul- 

In the case where one of the controllers experiences a gated a standard for SCSI which is described in ANSI 
failure, the computer must take action to isolate the document number X3. 130- 1986. 
failed controller and to switch the memory devices Bus 45 is additionally coupled to large output buffer 
formerly under the failed controller's control to a prop- Buffer 22 is in turn coupled to computer 10. In this 

erly functioning other controller. The switching re- 50 system, as blocks of data are read from the individual 
quires the computer to perform a number of operations. data disk drives, they are individually and sequentially 
First, it must isolate the failed controller. This means placed on the bus and simultaneously transmitted both 
that all data flow directed to the failed controller must to the large buffer and the ECC unit. The P and Q terms 
be redirected to a working controller. from disk drives 51 and 53 are transmitted to ECC 42 

In the system described above, it is necessary for the 55 only. The transmission of data and the P and Q terms 
computer to be involved with rerouting data away from over bus 45 occurs sequentially. The exact bus width 
a failed controller. The necessary operations performed can be any arbitrary size with 8-, 16- and 32-bit wide 
by the computer in completing rerouting requires the buses being common. 

computer's attention. This places added functions on After a large block of date is assembled in the buffer 
the computer which may delay other functions which 60 the calculations necessary to detect and correct data 
the computer is working on. As a result, the entire sys- errors, which use the terms received from the P and O 
tern is slowed down. disk drives, are performed within the ECC unit 42 If 

Another problem associated with disk operations, in errors are detected, the transfer of data to the computer 
particular writing and reading, is an associated probabil- is interrupted and the incorrect data is corrected if 
lty of error. Procedures and apparatus have been devel- 65 possible. 

oped which can detect and, in some cases, correct the During write operations, after a block of data is as- 
errors which occur during the reading and writing of sembled in buffer 22, new P and Q terms are generated 
the disks. With relation to a generic disk drive, the disk within ECC unit 42 and written to the P and Q disk 
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drives at the same time that the data in butter 22 is and a backup drive can be substituted for a failed disk 
written to the data disk drives. drive. 

Those disk drive systems which utilize known error The present invention provides an arrangement of 
correction techniques have several shortcomings. In the disk drive controllers, data disk drives and error correc- 
systems illustrated in FIGS. 1 and 2, data transmission is 5 tion code disk drives, the drives being each individually 
sequential over a single bus with a relatively slow rate coupled to a small bufTer memory and a circuit for error 
of data transfer. Additionally, as th.e error correction detection and correction. A first aspect of the present 
circuitry must wait until a block of data of predefined invention is error detection and correction which oc- 
size is assembled in the bufTer before it can detect and curs nearly simultaneously with the transfer of data to 
correct errors therein, there is an unavoidable delay 10 and from the disk drives. The multiple bufTer memories 
while such detection and correction takes place. can then be read from or written to in sequence for 

As stated, the most common form of data transmis- transfers on a data bus to the system computer. Addi- 
ston in these systems is serial data transmission. Given tionally, the error correction circuitry can be connected 
that the bus has a fixed width, it takes a fixed and rela- to all of the bufTer memory/disk drive data paths 
tively large amount of time to build up data in the buffer 15 through a series of multiplexer circuits called cross-bar 
for transmission either to the disks or computer. If the ("X-bar") switches. These X-bar switches can be used 
large, single bufTer fails, all the disk drives coupled to decouple failed buffer memories or disk drives from 
thereto become unusable. Therefore, a system which the system. 

has a plurality of disk drives which can increase the rate A number of disk drives are opcratively intercon- 
of data transfer between the computer and the disk 20 nected so as to function at a first logical level as one or 
drives and more effectively match the data transfer rate more logical redundancy groups. A logical redundancy 
to the computer's maximum efficient operating speed is group is a set of disk drives which share redundancy 
desirable. The system should also be able to conduct data. The width, depth and redundancy type (e g mir- 
this high rate of data transfer while performing all nec- rored data or check data) of each logical redundancy 
essary error detection and correction functions and at 25 group, and the location of redundant information 
the same time provide an acceptable level of perfor- therein, are independently configurable to meet desired 
mance even when individual disk drives fail. capacity and reliability requirements. At a second logi- 

Another failing of pnor art systems is that they do not cal level, blocks of mass storage data are grouped into 
exploit the full range of data organizations that are one or more logical data groups. A logical redundancy 
possible in a system using a group of disk drive arrays. 30 group may be divided into more than one such data 
In other words, a mass storage apparatus made up of a group. The width, depth, addressing sequence and ar- 
plurahty of physical storage devices may be called upon rangement of data blocks in each logical data group are 
to operate as a logical storage device for two concur- independently configurable to divide the mass data 
rently-running applications having different data stor- storage apparatus into multiple logical mass storage 
age needs. For example, one application requiring large 35 areas each having potentially different bandwidth and 
data transfers (i.e., high bandwidth), and-the other re- operation rate characteristics. 

quiring high frequency transfers (i.e., high operation A third logical level, for interacting with application 
rate). A third application may call upon the apparatus to software of a host computer operating system, is also 
provide both high bandwidth and high operating rate. provided. The application level superimposes logical 
Known operating techniques for physical device sets do 40 application units on the data groups to allow data 
not provide the capability of dynamically configuring a groups, alone or in combination from one or more re- 
single set of physical storage devices to provide optimal dundancy groups, to appear to application software as 
service in response to such varied needs. single logical storage units. 

It would therefore be desirable to be able to provide As data is written to the drives, the error correction 
a mass storage apparatus, made up of a plurality of 45 circuit, herein called the Array Correction Circuit 
physical storage devices, which could flexibly provide ("ACC"), calculates P and Q redundancy terms and 
both high bandwidth and high operation rate, as neces- stores them on two designated P and Q disk drives 
sary, along with high reliability. through the X-bar switches. In contrast to the discussed 

SUMMARY OF THE INVENTION prior the P rescnt invention's ACC can detect and 

50 correct errors across an entire set of disk drives simulta- 

The present invention provides a set of small, inex- neously, hence the use of the term "Array correction 

pensive disk drives that appears to an external computer Circuit." In the following description, the term ACC 

as one or more logical disk drives. The disk drives are will refer only to the circuit which performs the neces- 

arranged in sets. Data is broken up and written across sary error correction functions. The codes themselves 

the disk drives in a set, with error detection and correc- 55 will be referred to as Error Correction Code or "ECC" 

tion redundancy data being generated in the process On subsequent read operations, the ACC may compare 

and also being written to a redundancy area. Backup the data read with the stored P and Q values to deter- 

disk drives are provided which can be coupled into one mine if the data is error-free. 

or more of the sets. Multiple control systems for the sets The X-bar switches have several internal registers 

are used, with any one set having a primary control 60 As data is transmitted to and from the data disk drives^ 

system and another control system which acts as its it must go through a X-bar switch. Within the X-bar 

backup, but primarily controls a separate set. The error switch the data can be clocked from one register to the 

correction or redundancy data and the error detection next before going to the buffer or the disk drive The 

data is generaied "on the fly** as data is transferred to time it takes to clock the data through the X-bar internal 

the disk drives. When data is read from the disk drives, 65 registers is sufficient to allow the ACC to calculate and 

the error detection data is verified to confirm the integ- perform its error correction tasks. During a write oper- 

nty of the data. Lost data from a particular disk drive ation, this arrangement allows the P and Q values to be 

can be regenerated with the use of the redundancy data generated and written to their designated disk drives at 
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the same time as the data is written to its disk drives, the controllers and the first level controllers can also com- 

operations occurring ,n parallel In efTeci the X-bar municate between themselves. In . prfSd 

switches establish a data pipeline of several stages, the me m, the system is configured such Ta7!heTecond 

plurality of stages effecuvely providing a time delay level controllers are groupedTn ££ 

In one preferred embodiment, two ACC units are ' a^ia^^^ 
provided. Both ACCs can be used simultaneously on ^^J^ P ^^^ conftguranon pro- 
two operations that access different disk drives or one STto^^i^^ 8 pro ^ f u ™ for n ° W u of 
can be used if the other fails i . dn ves " For ease of understanding, the 

The X-bar switch arrangement also provides flexibil- 10 T^S^S^J^ ^ ^ 
ity in the data paths. Under control of the system con- S^^^llT etmXn ^ ™ cour ^' * should * 
troller. a malfunctioning disk drive can be decoupled ""nf^ "Ft* ,Cvd controlIers * 

from the system by reconfiguring the appropriate X-bar °°f f?! ?° UpS groupings, 
switch or switches and the data that was to be stored on « r a i** 0 " ls i " n P lemcn,cd 10 c °nnect each 

the failed disk can be rerouted to another data disk )5 ? J C Se °° n ? ,eVcl controIIers t0 a group of disk drives, 
drive. As the system computer is not involved in the * tha ! 8 Sec0nd ,cvcl comrolIer should fail, the 

detection or correction of data errors, or in reconfieur- ? )m P ule i r not get involved with the rerouting of 
ing the system in the case or failed drives or buffers ? *° the dlsk dnvcs ' ,nstead ' the flrst ,evel controllers 
these processes are said to be transparent to the system « nd *e properly working second level controller can 
computer. 20 e without the involvement of the corn- 

In a first embodiment of the present invention, a plu- {? Uter ' 71,15 all ? ws the ]o ^ ] configuration of the disk 
raJity of X-bar switches are coupled to a plurality of to remam constant from the perspective of the 

disk drives and buffers, each X-bar switch having at c . om P uter despite a change in the physical configura- 
least one data path to each buffer and each disk drive. In U0 !h 

operation a failure of any buffer or disk drive may be 25 . crc are two Icvels of sev erity of failures which can 
compensated for by rerouting the data flow through a anse in the second ,evel controllers. The first type is a 
X-bar switch to any operational drive or buffer. In this complete failure. In the case of a complete failure, the 
embodiment full performance can be maintained when ^cond level controller stops communicating with the 
disk drives fail. first level controllers and the other second level con- 

In another embodiment of the present invention, two 30 lrc * ller * Th e f" irst level controllers are informed of the 
ACC circuits are provided. In certain operating modes, failure by the properly working second level controller 
such as when all the disk drives are being written to or or mav recognize this failure when trying to route data 
read from simultaneously, the two ACC circuits are t0 the failed second level controller. In either case, the 
redundant, each ACC acting as a back-up unit to the first ,eveI controller will switch data paths from the 
other. In other modes, such as when data is written to 35 ^ ai,ed second level controller to the properly function- 
an individual disk drive, the two ACCs work in parallel, in S second level controller. Once this rerouted path has 
the first ACC performing a given action for a portion of been established, the properly functioning second level 
the entire set of drives, while the second ACC performs controller issues a command to the malfunctioning sec- 
a given action which is not necessarily the same for a ond ley el controller to release control of its disk drives, 
remaining portion of the set. 40 The properly functioning second level controller then 

In yet another embodiment, the ACC performs cer- assumes control of these disk drive sets, 
tain self-monitoring check operations using the P and Q The second type of failure is a controlled faOure 
redundancy terms to determine if the ACC itself is where the failed controller can continue to communi- 
functioning properly. If these check operations fail, the cate with the rest of the system. The partner second 
ACC will indicate its failure to the control system, and 45 leve > controller is informed of the malfunction. The 
it will not be used in any other operations. properly functioning second level controller then in- 

In still another embodiment, the ACC unit is coupled forms the first level controllers to switch data paths to 
to all the disk drives in the set and data being transmit- the functioning second level controller. Next, the mal- 
ted to or from the disk drives is simultaneously recov- functioning second level controller releases its control 
ered by the ACC. The ACC performs either error de- 50 of the disk drives and the functioning second level con- 
tection or error correction upon the transmitted data in troller assumes control. Finally, the properly function- 
parallel with the data transmitted from the buffers and ing second level controller checks and, if necessary 
the disk drives. corrects data written to the drives by the malfunction- 

Tne present invention provides a speed advantage ing second level controller, 
over the prior art by maximizing the use of parallel 55 A further aspect of the present invention is a SCSI 
paths to the disk drives. Redundancy and thus fault-tol- ' bus switching function which permits the second level 
erance is also provided by the described arrangement of controllers to release and assume control of the disk 
the X-bar switches and ACC units. drives. 

Another aspect of the present invention is that it For a more complete understanding of the nature and 
switches control of disk drive sets when a particular 60 the advantages of the invention, reference should be 
controller fails Switching is performed in a manner made to the ensuing detail description taken in con iunc- 
transparent to the computer. tion with the accompanying drawings. 

The controllers comprise a plurality of first level 
controllers each connected to the computer. Connected BRIEF DESCRIPTION OF THE DRAWINGS 
to the other side of the first level controllers is a set of 65 FIG. 1 is a block diagram illustrating a prior art disk 
second level controllers. Each first level controller can array system* 

route data to any one of the second level controllers. FIG. 2 is a block diagram illustrating a prior art disk 
Communication buses tie together the second level array system with an error check and Correction block; 
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«H G V a 1 * 3 di8t if ain """"J 1 ™* a Preferred «nbodi- ing. bul not limited to. floppy disks, magnetic tape 

mem fihe verall system of the present inventi n: drives, and optica) disks. 

FIG. 3.5 is a diagram showing a pair of disk drive sets 

18A and 18B connected to a pair of second level con- Overall System Environment 

IroHers 14A and 14B. 5 One preferred embodiment of the present invention 

riu. 4 is a diagram showing a m re detailed illustra- operates in the environment sh wn in FIG. 3. In FIG 3 

tion of FIG. 3 including the interconnections f the computer 10 communicates with a group of disk drive 

switches and the disk drives within the disk drive sets; sets 18 through controller 11. In a preferred erobodi- 

FrG. 5 is a block diagram of the wiring between the ment, controller 11 includes a number of components 
controllers and the switches; 10 which permit computer 10 to access each of disk drive 

FIG. 6 is a block diagram showing the schematic set* M even when there is a failure in one of the compo- 

circuitry of the switching function control circuitry nente of controller 11. As shown m FIG. 3, controller 

shown in FIG. 5; 11 includes a pair of two-level devices 13. Within each 

FIG. 7 is a recovery state transition diagram illustrat- of the two-level devices 13 is a first level controller 12 
ing the various possible steles of a particular second 15 and 8 second level controller 14. A switch 16 which 

level controller; comprises a group of switches permits computer 10 to 

FIGS. 8A-81 show the events which take place dur- access disk drive sets 18 through more than one path. In 

ing the transition between each of the states shown in th . is wa y> f either of two-level devices 13 experience a 

FIG. 7; failure in one of their components, the path may be 

FIG. 9 is a block diagram of one preferred embodi- 20 re-routed without computer 10 being interrupted, 
ment of the X-bar circuitry; FIG* 3.5 is a diagram showing a pair of disk drive sets 
FIG. 10 is a block diagram of a preferred embodiment 18A and 188 connected to a pair of second level con- 
of the error check and correction circuitry; trollers 14A and 14B. Controllers 14A and 14B each 
FIG. 11 is a detailed block diagram of the X-bar include two interface modules 27 for interfacing second 
switches and the ACC shown in FIG. 10; 15 ,eve ' controllers 14 with a pair of first level controllers 
FIGS. 12a and 126 show the logic operations neces- 12 ^ shown in P 10 - Interface modules 27 are con- 
sary to calculate the P and Q error detection terms- nected to buffers 330-335 which buffer data to be trans- 
FIGS. 13o and 13b show how the Reed-Solomon J" 1 "? 1 to and received from the disk drives. Second 
codeword is formed and stored in one embodiment of controllers 14 are configured to be primarily re- 
tire present invention; 30 sponsible for one group of disk drives and secondarily 
FIGS. 14a and Ub show the parity detector and responsible for a second group of disk drives. As shown, 
parity generator circuits in the ACC- second level controller 14A is primarily responsible for 
FIGS. 15. 16, 17, and 18 show, respectively, the data d ? Ves J0A1 .' 20 "' 20C1 - 2001 2 °E1 and sec- 
flow during a Transaction Mode Normal Read, a Trans- „ MnT^nH^", a ford " kdri v«*»A2,20B2. 20C2. 
action Mode Failed Drive Read, a Transaction Mode i, ^ A „ spare dnve 20X ,$ snared bv *>* 
Read-Modify-Write Read and a Transaction Mode T%5?££ ^"^"e^js activated to take over for 
Read-Modify-Write Write; 8 ** d " ve J wh,cn has fa,led - 

FIG. 19 is a schematic diagram of a set of disk drives t JJl 2 ^V.f TE?* '° se J? )nd |f vel c °n- 

in which check data is distributed among drives of the 40 KSJ?-£^? « *. ,nterfaces 31 These 

set according to a known technique- 40 ' n,erfaces ™ «« b > controllers 14 to configure the disk 

FIG. 20 is a schematic diagram of a mass storage j££ 2 ^, l ^^ l ST^^i to ^^J idl 

system suitable for use with the present invention; io^f JSlim ' ^ " nd , MA2, MB2> 

FIG. 21 is a schematic diagram of the distribution of 3S^Se?2o£ andinpT ^".l" W 

data on the surface of a magnetic disk- « . Sk • VCS 2 ° El ff" 1 20E2 mav set to store 

FIG. 22 is a schematic diagram of the distribution of " loXis^Z^uJI 2? ^ 

data in a first preferred embodiment of a redundancy 3?£Sn atll jZZSfZF' 

group according to the present invention; lariet^o^fl-r^oT 8 8V * Wde 

FIG. 23 is a schematic diagram of the distribution of ^KS 4 iTmore detaillrf diagram t. • .k • 

data in a second, more particularly preferred embodi- J0 condor ~ tailed duigram showing the inter. 

Kit redundancy 8 ' 0up accoi ,0 *" *~* - 

nG.iiisadiagramshowinghowthememoryspace t^iS^^'Sj^S^'S 
of a device set might be configured in accordance with Second level controllers 14A and 1« are show„ con 

,h FI^°d^^ 

P resenl invention. A ows M wcI1 M conuo , Md sutus ^ 

DETAILED DESCRIPTION OF THE ,lne ^tween second level controller 14A and second 

DRAWINGS 60 ,cvel controller 14B represents a communication line 

The nrefrrrpH ^nrwiim^c «r .1. . • through which the second level controllers communi- 

i ne preterred embodiments of the present invention cate with each other 

fS^Snli 0 *^ -t^'t S, ° rage K ,n Pre j ScC0nd lcvel strollers 14 are each connected to a 
ferred embodimenu described herein, the preferred group of disk drive sets 18A-18F throueh switches 
devices for storing data are hard disk drives, referenced 65 16A-16F tnrougn switches 

^, B . disk dr ? ves No ! hi "6 hM «n should be under- Disk drives 20 are arranged in a manner so that each 
stood o limit this invention to using disk drives only. second level controller 14 is primaril^SSe for 
Any other device for stonng data may be used, includ- one group f disk drive sets. E„ in^HG 4 sec 
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ond level controller 14A may be primarily responsible 
for three of the disk drive sets 18A-18F. Similarly, 
second level controller 14B may be primarily responsi- 
ble for the remaining three disk drive sets 18A-18F. 
Second level controllers 14 are secondarily responsible 5 
for the disk drives primarily controlled by the partner 
second level controller. In the particular arrangement 
shown in FIG. 4, second level controller 14A may be 
primarily responsible for the left three disk drive sets 
18A, 18B and 18C and secondarily responsible for the 10 
right three disk drives sets 18D, 18E and 18F. Second 
level controller 14B is primarily responsible for the 
right three disk drive sets 18D, 18E and 18F and sec- 
ondarily responsible for the left three disk sets 18A, 18B 
and 18C. IS 

Each second level controller 14 contains a second 
level controller recovery system (CRS) 22. CRS 22 is a 
portion of software code which manages the communi- 
cation between second level controllers 14 and first 
level controllers 12. CRS 22 is typically implemented as 20 
a state machine which is in the form of microcode or 
sequencing logic for moving second level controller 14 
from state to state (described below). State changes are 
triggered as different events occur and messages are 
sent between the various components of the system. 25 

An ECC block 15 is also included in each second 
level controller 14. ECC block 15 contains circuitry* for 
checking and correcting errors in data which occur as 
the data is passed between various components of the 
system. This circuitry is described in more detail below. 30 

FIG. 5 is a block diagram showing a more detailed 
illustration of the interconnections between second 
level controllers 14A and 14B and the disk drives. For 
simplicity, only a single disk drive port is shown. More 
disk drive ports are included in the system as shown in 35 
FIGS. 3 and 4. 

Second level controller 14 A has a primary control A 
sense line 50A for controlling its primary set of disk 
drives. An alternate control/sense line 52A controls an 
alternate set of disk drives. Of course, second level 40 
controller 14B has a corresponding set of control/sense 
lines. Data buses 54A (second level controller 14A) and 
54B (second level controller 14B) carry the data to and 
from disk drives 20. These data buses are typically in the 
form of a SCSI bus. 45 

A set of switches 16A-16F are used to grant control 
of the disk drives to a particular second level controller. 
For example, in FIG. 4, second level controller 14A has 
primary responsibility for disk drives 20A-20C and 
alternate control of disk drives 20D-20F. Second level 50 
controller 14B has primary control of disk drives 
20D-20F and alternate control of disk drives 20A-20C. 
By changing the signals on control/sense lines 50 and 
52 t primary and secondary control can be altered. 

FIG. 6 is a more detailed illustration of one of the 55 
switches 16A-16F. A pair of pulse shapers 60A and 60B 
receive the signals from the corresponding control/- 
sense lines 50A and 52B shown in FIG. 5. Pulse shapers 
60 clean up the signals which may have lost clarity as 
they were transmitted over the lines. Pulse shapers of 60 
this type are well% known in the art. The clarified 
signals from pulse shapers 60 are then fed to the set and 
reset pins of R/S latch 62. The Q and Q outputs of latch 
62 are sent to the enable lines of a pair of driver/receiv- 
ers 64A and 64B. Driver/receivers 64A and 64B are 65 
connected between the disk drives and second level 
controllers 14A and 14B. Depending upon whether 
primary control/sense line 52B or alternate control/- 
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sense line 50A is active, the appropriate second level 
controller will be in control at a particular time. 

FIG. 7 is a state transition diagram showing the rela- 
tionships between the various states of CRS 22 (FIG. 3) 
of a particular second level controller 14. Each second 
level controller 14 must be in only one state at any 
particular point in time. Initially, assuming that the 
system is functioning properly and each second level 
controller 14 is primarily responsible for half of the disk 
drive sets 18 and secondarily responsible for half of the 
disk drive sets 18, second level controller 14 is in a 
PRIMARY STATE 26. While in PRIMARY STATE 
26, two major events may happen to move a second 
level controller 14 from PRIMARY STATE 26 to 
another state. The first event, is the failure of the partic- 
ular second level controller 14. If there is a failure, 
second level controller 14 shifts from PRIMARY 
STATE 26 to a NONE STATE 28. In the process of 
doing so, it will pass through RUN-DOWN-PR1MAR- 
IES-TO-NONE STATE 30. 

There are two types of failures which are possible in 
second level controller 14. The first type of failure is a 
controlled failure. Further, there are two types of con- 
trolled failures. 

The first type of controlled failure is a directed con- 
trolled failure. This is not actually a failure but instead 
an instruction input from an outside source instructing a 
particular second level controller to shut down. This 
instruction may be received in second level controller 
14 from one of the following sources: An operator, 
through computer 10; a console 19 through a port 24 
(e.g. RS-232) on the first level controller; a diagnostic 
console 21 through a port 23 (e.g. RS-232) on the sec- 
ond level controller; or by software initiated during 
predictive maintenance. Typically, such an instruction 
is issued in the case where diagnostic testing of a second 
level controller is to be conducted. In a directed con- 
trolled failure, the second level controller finishes up 
any instructions it is currently involved with and refuses 
to accept any further instructions. The second level 
controller effects a "graceful" shut down by sending 
out messages to the partner second level controller that 
it will be shutting down. 

The second type of controlled failure is referred to as 
a moderate failure. In this case, the second level con- 
troller recognizes that it has a problem and can no 
longer function properly to provide services to the 
system. For example, the memory or drives associated 
with that second level controller may have malfunc- 
tioned. Therefore, even if the second level controller is 
properly functioning, it cannot adequately provide ser- 
vices to the system. It aborts any current instructions, 
refuses to accept any new instructions and sends a mes- 
sage to the partner second level controller that it is 
shutting down. In both controlled failures, the malfunc- 
tioning second level controller releases the set of disk 
drives over which it has control. These drives are then 
taken over by the partner second level controller. 

The second type of failure is a complete failure. In a 
complete failure, the second level controller becomes 
inoperable and cannot send messages or "clean-up** its 
currently pending instructions by aborting them. In 
other words, the second level controller has lost its 
ability to serve the system. It is up to one of the first 
level controllers or the partner second level controller 
to recognize the problem. The partner second level 
controller then takes control of the drives controlled by 
the malfunctioning second level controller. The routing 
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through the malfunctioning second level controller is 
switched over to the partner second level controller. 

In all of the above failures, the switching takes place 
without interruption to the operation of the computer. 
Second level controllers 14 and first level controllers 12 
handle the rerouting independently by communicating 
the failure among themselves. 

Assuming there was a failure in second level control- 
ler 14A, second level controller 14A moves from PRI- 
MARY STATE 26 through a transition RUN-DO WN- 
PRIMARIES-TO-NONE STATE 30 to NONE 
STATE 28. At the same time, properly functioning 
second level controller 14B moves from PRIMARY 
STATE 26 to BOTH STATE 32. The basis for the 
change in state of each of second level controllers 14A 
and 14B is the failure of second level controller 14A. 
When a second level controller fails, it is important to 
switch disk drive control away from the failed second 
level controller; This, permits computer 10 to continue 
to access disk drives which were formerly controlled by 
a particular second level controller which has failed. In 
the current example (FIG. 4), disk drive sets 18A-18C 
are switched by switching functions 16A-16C so that 
they are controlled by second level controller 14B. 
Therefore, second level controller 14B is in BOTH 
STATE 32 indicating that it has control of the disk 
drive sets 18 for both second level controllers. Second 
level controller 14A now controls none of the disk 
drives and is in NONE STATE 28. The transition state 
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computer 10 could access any one of disk drive sets 18. 
As in the previous example, if second level controller 
14 A were t fail it moves from SECONDARY STATE 
36 through RUN -DOWN- SECOND A RIES-TO- 
NONE STATE 40 and into NONE STATE 28. At the 
same time, properly functioning second level controller 
14B moves from SECONDARY STATE 36 along the 
preempt b/p line into BOTH STATE 32. Preempt b/p 
stands for "preempt both/primaries." In other words, 
all of the disk drives are preempted by the properly 
functioning second level controller. 

If, for all sets 18, second level controller 14A is in 
NONE STATE 28 and second level controller 14B is in 
BOTH STATE 32, it is possible for second level con- 
troller 14A to take control of all sets 18 of disk drives. 
This is desirable if second level controller 14A were 
repaired and second level controller 14B failed. Second 
level controller 14A moves from NONE STATE 28 
along the preempt b line to BOTH STATE 32. At the 
same time, second level controller 14B moves from 
BOTH STATE 32 through RUN-DOWN-BOTH-TO- 
NONE STATE 42 and into NONE STATE 28. At this 
point, second level controller 14A controls all disk 
drives while second level controller 14B controls none 
of the disk drives. 

Various failures may trigger the movement of second 
level controllers 14 between states. Between states a 
number of events take place. Each of these events is 
described in FIGS. 8A-8L In FIG. 8A, second level 



30 determines which of several possible transition paths 30 controller 14 is in PRIMARY STATE 26. There are 

three different events which can take place while sec- 
ond level controller 14 is in PRIMARY STATE 26. 
The first event is for a preempt message 100 to be re- 
ceived from the partner second level controller. At this 
35 point, the second level controller receiving such a mes- 
sage will take the secondary path, represented by block 
102, and end up at BOTH STATE 32. The second path 
which may be taken is triggered by receipt of a message 
104 from CRS 22 of the other second level controller. 



is used 

If second level controller 14A is in NONE STATE 
28 and second level controller 14B is in BOTH STATE 
32 there are a number of options for transferring control 
of disk drive sets 18A-18F once second level controller 
14A has been repaired. First, second level controller 
14A and second level controller 14B could each be 
shifted back to PRIMARY STATE 26. This is accom- 
plished for drive sets 18A-18C by second level control- 



ler 14A moving from NONE STATE 28 directly to 40 This may be some sort of communication which results 



PRIMARY STATE 26 along the preempt p line. Pre 
empt p simply stands for "preempt primary** which 
means that second level controller 14A preempts its 
primary drives or takes control of them from second 
level controller 14B. At the same time, second level 45 
controller 14B moves from BOTH STATE 32 through 
a transition RUN-DOWN-SECONDARIES-TO-PRI- 
MARIES STATE 34 and then to PRIMARY STATE 
26. 



in the second level controller remaining in PRIMARY 
STATE 26. It will report and return messages 106 to 
the other second level controller. The final path which 
may be taken results in second level controller ending 
up in RUN-DOWN-PRIMARIES-TO-NONE 
STATE 30. This path is triggered upon receipt of a 
message 108 to release both sets of drives or the primary 
disk drives. A timer is then set in block 110 and upon 
time out a message 112 is sent to the other second level 



A second alternative is for second level controller 50 controller to take control of the primary set of disk 
*t~^ — — - drives. Once in RUN-DO WN-PRIMARIES-TO- 

NONE STATE 30, second level controller 14 will 
eventually end up in NONE STATE 28. 

FIG. 8B illustrates various paths from RUN-DOWN- 
PRIMARIES-TO-NONE STATE 30 to NONE 
STATE 28. Three possible events may take place. First, 
a message 114 may be received from another second 
level controller providing communication information. 
In this case, second level controller 14 reports back 
60 messages 116 and remains in RUN-DO WN-PRJMAR- 
IES-TO-NONE STATE 30. The second event which 
may occur is for the timer, set during transition from 
PRIMARY STATE 26 to RUN-DOWN- PRI MAR- 



MA to move from NONE STATE 28 to SECOND- 
ARY STATE 36. Once in SECONDARY STATE 36, 
second level controller 14A is in control of its second- 
ary disk drive sets 18D-18F. Second level controller 
14B concurrently moves from BOTH STATE 32 55 
through RUN-DOWN-PRIMARIES-TO-SECON- 
DARIES STATE 38 and on to SECONDARY 
STATE 36. When both second level controllers are in 
SECONDARY STATE 36, they are in control of their 
secondary disk drive sets. Second level controller 14A 
controls disk drive sets 18D-18F and second level con- 
troller 14B controls disk drive sets 18A-18C. 

From SECONDARY STATE 36, a failing second 
level controller 14 may move through RUN-DOWN 



IES-TO-NONE STATE 30 to time out 118. If this 
SECONDARIES-TO-NONE STATE 40 to NONE 65 happens, second level controller 14 realizes that mes- 
STATE 28. If this occurs, the properly functioning sage 112 (FIG. 8A) didn't get properly sent and that 
partner second level controller 14 moves from SEC- there has been a complete failure. It releases control of 
ONDARY STATE 36 to BOTH STATE 32 so that both its primaries and secondary disk drives 122. It then 
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ends up in NONE STATE 28. The ihird event which 
may occur while in RUN-DOWN-PRIMARIES-TO- 
NONE STATE 30 is for a response to be received 124 
from message 112 (FIG. 8A) sent out while second level 
controller moved from PRIMARY STATE 26 to 5 
RUN-DOWN-PRIMARIES-TO-NONE STATE 30. 
This response indicates that the message was properly 
received. Second level controller 14 then releases its 
primary drives 126 and ends up in NONE STATE 28. 

FIG. 8C covers the state transition between NONE 10 
STATE 28 and one of either BOTH STATE 32. PRI- 
MARY STATE 26, or SECONDARY STATE 36. 
When in NONE STATE 28, second level controller 14 
can only receive messages. First, it may receive a mes- 
sage 128 instructing it to preempt both its primary and 15 
alternative sets of disk drives. It performs this function 
130 and ends up in BOTH STATE 32. A second possi- 
bility is for it to receive a preempt message 132 instruct- 
ing it to preempt its primary set of drives. It performs 
this instruction and ends up in PRIMARY STATE 26. 20 
A third alternative is the receipt of a preempt message 
136 instructing second level controller 14 to preempt its 
secondary drives. Upon performance of this instruction 
138 it ends up in SECONDARY STATE 36. Finally, 
while in NONE STATE 28 second level controller 14 25 
may receive communication messages 140 from its part- 
ner second level controller. It reports back 142 to the 
other second level controller and remains in NONE 
STATE 28. 

FIG. 8D illustrates the movement of second level 30 
controller 14 from SECONDARY STATE 36 to 
BOTH STATE 32 or RUN-DOWN-SECONDAR- 
IES-TO-NONE STATE 40. While in SECONDARY 
STATE 36, any one of three messages may be received 
by second level controller 14. A first possibility is for a 35 
preempt both or primary message 144 to be received. At 
this point, second level controller 14 takes control of its 
primary" drives 146 and ends up in BOTH STATE 32. A 
second possibility is for communication messages 148 to 
be received from the partner controller. This results in 40 
second level controller 14 reporting back 150 and re- 
maining in its present SECONDARY STATE 36. Fi- 
nally, a release both or secondary message 152 may be 
received. Second level controller 14 sets a timer 154 
upon receipt of this message. It then sends out a message 45 
156 indicating it is now in RUN-DOWN-SECON- 
DARIES-TO-NONE STATE 40. 

FIG. 8E shows the transition of second level control- 
ler 14 from RUN-DOWN-SECONDARIES-TO- 
NONE STATE 40 to NONE STATE 28. Three difler- 50 
ent messages may be received during RUN-DOWN- 
SECONDARIES-TO-NONE STATE 40. First, mes- 
sages 158 from the partner second level controller may 
be received. Second level controller 14 then reports 
back (160) to its partner and remains in RUN-DOWN- 55 
SECONDARIES-TO-NONE STATE 40. A second 
possibility is for the timer, set between SECONDARY 
STATE 36 and the present stale, to time out (162). This 
indicates that message 156 (FIG. 8D) was not properly 
sent out and received by the partner second level con- 60 
troller and that there has been a complete failure to 
second level controller 14. Second level controller 14 
then reports out (164) that it will release both of its sets 
of disk drives 166. This results in it moving to NONE 
STATE 28. Finally, second level controller 14 may 65 
receive a response 168 to its message 156 (FIG. 8D) sent 
after setting the timer between SECONDARY STATE 
36 and RUN-DOWN-SECONDARIES -TO-NONE 
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STATE 40. Upon receiving this response, it releases its 
secondary drives and ends up in NONE STATE 28. 

FIG. 8F illustrates the various paths from BOTH 
STATE 32 to any one of RUN-DOWN-PRIM ARJES- 
TO-SECONDARIES STATE 38, RUN-DOWN- 
SECONDARIES-TO-PRIMARIES STATE 34 or 
RUN-DOWN-BOTH-TO-NONE STATE 42. A first 
possible message which may be received during BOTH 
STATE 32 is a release primary message 172. This will 
cause second level controller 14 to set a timer 174, send 
a message 176 indicating it is running down primaries, 
and wait in RUN-DOWN-PRIMARIES-TO-SECON- 
DARIES STATE 38. A second message which may be 
received is a release secondaries message 180. Upon 
receiving release secondaries message 180, second level 
controller 14 sets a timer 182 and sends a message 184 
indicating it has moved into RUN-DOWN-SECON- 
DAR1ES-TO-PRI MARIES STATE 34. A third possi- 
bility for second level controller 14 is to receive com- 
munication messages 186 from its partner second level 
controller. It will report back (188) and remain in 
BOTH STATE 32. Finally, second level controller 14 
may receive an instruction 190 telling it to release both 
primary and secondary sets of drives. At this point it 
sets the timer 192 and sends out a message 194 that it has 
released both primary and secondary drive sets. It will 
then remain in the RUN-DOWN-BOTH-TO-NONE 
STATE 42 until it receives further instructions from the 
other second level controller. 

FIG. 8G shows the various paths by which second 
level controller 14 moves from RUN-DOWN-PRI- 
MARIES-TO-SECONDARIES STATE 38 to one of 
either NONE STATE 28 or SECONDARY STATE 
36. The first possibility is that second level controller 14 
receives messages 196 from the other second level con- 
troller. It then reports back (198) and remains in RUN- 
DOWN-PRIMARIES-TO-SECONDARIES STATE 
38. A second possibility is that the timer (174), set be- 
tween BOTH STATE 32 and RUN DOWN-PRI- 
MARIES-TO-SECOND ARIES STATE 38 times out 
(200). At this point, second level controller 14 realizes 
that message 176 (FIG. 8F) was not properly sent. A 
complete failure has occurred. The second level con- 
troller reports (202) that it has released both sets of disk 
drives, and releases both sets (204). Second level con- 
troller 14 then enters NONE STATE 28#Einally. a run 
down path response message 206 is rec^e^acknowl- 
edging receipt of message 176 (FIG. 8F) sent Ib^tw&h 
BOTH STATE 32 and RUN-DOWN-PRIMARIES- 
TO-SECONDARIES STATE 38. Second level con- 
troller 14 releases its primary drives 208 and enters 
SECONDARY STATE 36. 

FIG. 8H shows the possible paths down which sec- 
ond level controller 14 moves between RUN-DOWN- 
SECONDARIES-TO-PRIM ARIES STATE 34 and 
one of either NONE STATE 28 or PRIMARY 
STATE 26. A first possibility is that second level con- 
troller 14 receives a message 210 from the other second 
level controller. It then reports back (212) and remains 
in RUN-DOWN-SECONDARIES-TO-PRIMARIES 
STATE 34. A second possibility is that the timer (182), 
set between BOTH STATE 32 and RUN-DOWN- 
SECONDARIES-TO PRIMARY-STATE 34 times 
out (214). If this occurs, second level controller 14 real- 
izes that message 184 (FIG. 8F) was not properly sent. 
A complete failure has occurred. Second level control- 
ler then sends a message 216 indicating that it has re- 
leased its drives and then it releases both primary and 
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secondary disk drive sets (218) which it controls. Sec- 
ond level controller then moves into NONE STATE 
28. Finally, a third possibility is that second level con- 
troller 14 receives a response 220 to message 184 (FIG. 
8F) sent between BOTH STATE 32 and RUN- 
DOWN-SECONDARIES-TO-PRIMARIES-STATE 
34. It will then release (222) its secondary drives and 
enter PRIMARY STATE 26. 

FIG. 81 shows the possible paths illustrating the tran- 
sition of second level controller between RUN- 
DOWN-BOTH-TO-NONE STATE 42 and NONE 
STATE 28. Three possible events may take place. First, 
a message 230 may be received from the other second 
level controller providing communication information. 
In this case, second level controller 14 reports back 
messages 232 and remains in RUN-DOWN-BOTH-TO- 
NONE STATE 42. The second event which may occur 
is for the timer (192), set during transition from BOTH 
STATE 32 to RUN-DOWN-BOTH-TO-NONE 
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very high input/output rate of data as well as a high 
data thr ughput. 

T illustrate this embodiment's mode f operation, 
the following example is offered. Referring to FIG. 9, 
assume that all data flow is initially direct, meaning, for 
example, that data in buffer 330 flows directly through 
X-bar switch 110 to disk drive 20A1. Were buffer 330 1 
fail, the registers of X-bar switch 310 could be recon- 
figured, enabling X-bar switch 310 to read data from 
buffer 335 and direct that data to disk drive 20A1. Simi- 
lar failures in other buffers and in the disk drives could 
be compensated for in the same manner. 

Generation of Redundancy Terms and Error Detection 
on Parallel Data 

FIG. 10 illustrates a second preferred embodiment of 
the present invention. This second embodiment incor- 
porates Array Correction Circuits ("ACCs") to provide 
error detection and correction capabilities within the 
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level controller 14 realizes that message 194 (FIG. 8F) 
sent during BOTH STATE 32 didn't get properly sent 
and that there has been a complete failure. It releases 
control of both its primaries and secondary disk drives 
(238). It then ends up in NONE STATE 28. The third 
event which may occur while in RUN-DOWN-BOTH- 
TO-NONE STATE 42 is for a response to be received 
(240) from message 194 (FIG. 8F) sent out while see end 
level controller moved from BOTH STATE 32 to 
RUN-DOWN-BOTH-TO-NONE STATE 42. This 
response indicates that the message was properly re- 
ceived. Second level controller 14 then releases both 
sets of drives (242) and ends up in NONE STATE 28. 

Rerouting Data Paths Between Buffers and Disk Drives 35 

FIG. 9 illustrates a first preferred embodiment of 
circuitry for rerouting data paths between buffers and 
disk drives 20. In FIG. 9, X-bar switches 310 through 



preferred embodiment shown in FIG. 9. To ease the 
understanding of this embodiment, the full details of the 
internal structure of both the X-bar switches (310 
through 315) and the ACC circuits 360 and 370 are not 
25 shown in FIG. 10. FIGS. 11 and 12 illustrate the inter- 
nal structure of these devices and will be referenced and 
discussed in turn. Additionally, bus LBE as illustrated 
in FIG. 10 does not actually couple the second level 
controller (FIGS. 3 and 4) directly to the X-bar 
30 switches, the ACCs. and the DSI units. Instead, the 
second level controller communicates with various sets 
of registers assigned to the X-bar switches, the ACCs 
and the DSI units. These registers are loaded by the 
second level controller with the configuration data 
which establishes the operating modes of the aforemen- 
tioned components. As such registers are known, and 
their operation incidental to the present invention, they 
are not illustrated or discussed further herein. 
The embodiment shown in FIG. 10 shows data disk 
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315 are coupled to a bus 309 communicating with the 40 hHvm mai fftim»«fi maa r> - j X J , "I 
second level controller engine (see FIOS 3 and dv In 40 ? nv ^20Al through 20A4 and P and Q redundancy 



second level controller engine (see FIGS. 3 and 4). In 
turn, each X-bar switch is coupled by a bus to disk 
drives 20A1 through 20A6 and to each buffer 330 
through 336. Bus 350 couples each buffer to a first level 



term drives 20A5 and 20A6. A preferred embodiment 
of the present invention utilizes 13 disk drives: ten for 
data, two for P and Q redundancy terms, and one spare 

controller which are coupled to a computer such as 4S fi£ * ^!L* nndeR, ° < ? d t . hat «»« 

, n /iru~c » ~~aa\ 1 .u' v j- , 45 "umber of drives, and their exact utilization may vary 

computer 10 FIGS. 3 and 4). In this embodiment, al- withoiIt in any changing the present invention 

though only six disk drives are .Ilustra.ed. any arbitrary ^ disk driv * is co y upJed &, g btl ££K Si 

number could be used, as long as the illustrated archi- computer Standard Interfax) to^M S£rtb345 

lecture is preserved by increaang the number or X-bar herein labelled DSI. The DSI units perform somf error 

swuches and output buffers m a like manner and main- » detecting functions as well as buffering ™u fl"l1n" 

and out of the disk drives. 



taining the interconnected bus structures illustrated in 
FIG. 9. 

In operation, the second level controller will load 
various registers (not illustrated herein) which config 



Each DSI unit is in turn coupled by a bi-directional 
bus means to an X-bar switch, the X-bar switches herein 
numbered 310 through 315. The X-bar switches are 
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buffers and particular disk drives. The particular config- 
uration can be changed at any time while the system is 
operating. Data flow is bi-directional over all the buses. 
By configuring the X-bar switches, data flowing from 



• "~ — — — — •••« WQIJ \JJ 

means of a bi-directional bus. The bus width in this 
embodiment is 9 bits, 8 for data, 1 for a parity bit. The 
word assemblers assemble 36-bit (32 data and 4 parity) 
words for transmission to buffers 330 through 335 over 
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vice versa. Failure of any particular system element 
does not result in any significant performance degrada- 
tion, as data flow can be routed around the failed ele- 
ment by reconfiguring the registers for the X-bar 
switch. In a preferred mode of operation, data may be 
transferred from or to a particular disk drive in parallel 
with other data transfers occurring in parallel on every 
other disk drive. This mode of operation allows for a 



flows from the output buffers to the X-bar switches, the 
word assemblers decompose the 36-bit words into the 
9-bits of data and parity. 
The X-bar switches are also coupled to ACC units 
65 348 and 349. The interconnection between the X-bar 
switches and the ACCs is shown in more detail in FIG. 
11. Each X-bar switch can send to both or either ACC 
the 8 bits of data and 1 parity bit that the X-bar switch 
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receives from either the DSl units or the word assem- 
blers. In turn, the X-bar switches can receive 9 bits of Redundancy Generation and Error Checking 
the P and redundancy terms calculated by the ACCs Equations 

ver lines Ej and E2. As shown, the ACCs can direct The main functional components of the second pre- 
the P and Q redundancy terms to any X-bar switch, not 5 ferred embodiment and their physical connections to 
being limited to the disk drives labelled P and Q. De- one another have now been described. The various 
pending on the configuration commanded by the sec- preferred modes of operation will now be described. In 
ond level controller, ACCs 348 and 349 can be mutually order to understand these functional modes, some un- 
redundant, in which case the failure of one or the other derstanding of the error detection and correction 
ACC does not affect the system's ability to detect or 10 method used by the present invention will be necessary, 
correct errors, or each ACC can detect and correct Various error detection and correction codes are 
errors on a portion of the total set of disk drives. When known and used in the computer industry. Error-Con- 
operating in this second manner, certain specific types trol Coding and Applications, D. Wiggert, The MITRE 
of operations which write data to individual disk drives Corp., describes various such codes and their calcula- 
te expedited, as each ACC can write to a separate 15 t*°n- The present invention in this second preferred 
individual disk drive. The specific disk drives that the embodiment is implemented using a Reed-Solomon 
individual ACCs monitor can be reconfigured at any error detection and correction code. Nothing herein 
time by the second level controller. should be taken to limit the present invention to using 

The illustrated connections of the ACCs and the 2Q only a Reed-Solomon code. If other codes were used, 

X-bar switches also allows data to be switched from any various modifications to the ACCs would be necessary! 

X-bar switch to any ACC once the second level con- oul these modifications would in no way change the 

troller configures the related registers. This flexibility essential features of this invention, 

allows data to be routed away from any failed disk drive Reed-Solomon codes are generated by means of a 

or buffer. 25 field generator polynomial, the one used in this embodi- 

FIG. 11 shows important internal details of the ACCs mem ^ing X 4 +X + 1. The code generator polynomial 

and the X-bar switches. X-bar switch 310 is composed needed for this Reed-Solomon code is (X + a°)(X- 

of two mirror-image sections. These sections comprise, +a I ) = X 2 + a 4 X+a l . The generation and use of these 

respectively. 9-bit tristate registers 370/380, lo multi- codes to detect and correct errors is known, 

plexers 372/382, first 9-bit registers 374/384, second 30 Tbe ac1ual implementation of the Reed-Solomon 

9-bit registers 376/386, and input/output interfaces cod . e in tne Present invention requires the generation of 

379/389. In operation, data can flow either from the various terms and syndromes. For purposes of clarity, 

word assembler to the DSl unit or vice versa. these terms ar e generally referred to herein as the P and 

Although many pathways through the X-bar switch Q redundancy terms. The equations which generate the 

are possible, as shown by FIG. 11, two aspects of these 35 P and Q redundancy terms are: 
pathways are of particular importance. First, in order to 

allow the ACC sufficient time to calculate P and Q ,+rf "- 2+ • + d ^ d * 
redundancy terms or to detect and correct errors, a data and 
pathway of several registers can be used, the data re- 
quiring one clock cycle to move from one register to 40 +4,-2*0-2+ . . . +^,+<W 
the next. By clocking the data through several registers, 

a delay of sufficient length can be achieved. For exam- The P redundancy term is essentially the simple parity 
pie. assuming a data flow from the word assembler unit of all the data bytes enabled in the given calculation 
to a disk drive, 9 bits are clocked into 9-bit register 374 The Q logic calculates the Q redundancy for all data 
and tri-state register 370 on the first clock pulse. On the 45 bytes that are enabled. For Q redundancy, input data 
next clock pulse, the data moves to 9-bit register 386 roust first be multiplied by a constant **a" before it is 
and through redundancy circuit 302 in the ACC 348 to summed. The logic operations necessary to produce the 
P/Q registers 304 and 306. The next clock pulses move P and Q redundancy terms are shown in FIGS. 12a and 
the data to the DSl unit. 12b. All operations denoted , by © are exclusive-OR 
The second important aspect of the internal pathways ("XOR") operations. Essentially, the final P term is the 
relates to the two tristate registers. The tri-state regis- sum of all P, terms. The Q term is derived by multiply- 
ters are not allowed to be active simultaneously. In ™g all Q/ terms by a constant and then XORing the 
other words, if either tristate register 370 or 380 is en- results. These calculations occur in redundancy circuit 
abled, its counterpart is disabled. This controls data 55 302 jn ACC 260 (FIG. 11). The second preferred em- 
transmission from the X-switch to the ACC. The data ' bodiment, using its implementation of the Reed-Solo- 
may flow only from the DSl unit to the ACC or from mon code, >s a &le to correct the data on up to two failed 
the word assembler to the ACC, but not from both to disk drives. 

the ACC simultaneously. In the opposite direction, data 1 7 ie correction of data requires the generation, of 
may flow from the ACC to the word assembler and the w additional tenns So and S] within the ACC. Assuming 
DSl simultaneously. th at the P and Q redundancy terms have already been 
ACC unit 348 comprises a redundancy circuit 302, calculated for a group of data bytes, the syndrome equa- 
wherein P and Q redundancy terms are generated, P tions 
and Q registers 304 and 306, wherein the P and Q redun- 
dancy terms are stored temporarily, regenerator and 65 •So-'k- 1+^-2+ - -• +rfi+rfo+J* 
corrector circuit 308, wherein the data from or to a c._/ rf ^ ^ ftt 
failed disk drive or buffer can be regenerated or cor- +Wi-fli)+(rfo*o)+P 
reeled, and output to interfaces 390, 391, 392 and 393. 
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are used to calculate So and Si. For So an ACC register Although FIG. 10 only shows four data drives and 
enables the necessary data bytes and the P redundancy the P and Q redundancy term drives, a preferred em- 
t be used in the calculation. For Si, the necessary input bodiment uses a set of 13 disk drives, 10 for data, 2 for 
data must first be multiplied by a/ before being summed the P and Q terms, and a spare. Although nothing 
with the Q redundancy information. 5 herein should be construed to limit this discussion to 

As stated, an ACC can correct the data on up to two that specific embodiment, parallel processing opera- 
failed disk drives in this embodiment. The failed disk tions will be described with relation to that environ- 
drive register (not illustrated) in the relevant ACC will ment. 
be loaded with the address of the failed disk or disks by 

the second level controller. A constant circuit within JO Parallel Processing Operations 

the ACC will use the drive location information to In parallel processing operations, all the drives are 
calculate two constants koand ki as indicated in Table 1 considered to comprise a single large set. Each of the 
below, where i represents the address of the first failed disk drives will either receive or transmit 9 bits of data 
disk drive, j is the address of the second failed disk simultaneously. The result of this is that the 9-bits of 
drive, and a is a constant. The columns labelled Failed 15 data appearing in the DS1 units of all the drives simulta- 
Drivcs indicate which drives have failed. Column ko neously are treated as one large codeword. This result is 
and k| indicate how those constants are calculated given shown in FIG. 13a Codeword 400 comprises 9 bits of 
the failure of the drives noted in the Failed Drives col- data from or for disk drive d«_ u 9 bits of data from or 

for disk drive d„-2. and so on, with the P and Q disk 
TABLE 1 20 drivcs receiving or transmitting the P and Q redun- 

dancy term. In a parallel write operation, all the disk 
drives in the set, except for the spare disk drive, will 
receive a byte of data (or a redundancy term whose 
length is equal to the data byte) simultaneously. As 
25 shown, the same sector in all the disk drives will receive 
a part of codeword 400. For example, in the illustration, 
sector 1 of disk drive n-1 will receive a byte of data 
designated d ff _ i from codeword 400, sector 1 of disk 
drive n-2 will receive a byte of data designated d„-2 
The error correction circuits use the syndrome informa- 30 from codeword 400 and so on. 
tion So and Sj. as well as the two constants koand ki to In the actual implementation of this preferred em- 
generate the data contained on the failed disk drives. bodiment, the codewords are "striped" across the vari- 
The error correction equations are as follows: ous disk drives. This means that for-each successive 

codeword, different disk drives receive the P and Q 
1 0+ 1 35 redundancy terms. In other words, drive d„_ i is treated 

fwSo+Ei as drive d "~ 2 for tne 560011(1 codeword and so on, until 

■S~ what was originally drive d n _i receives a Q redun- 

F, is the replacement data for the first failed disk drive. f. anCy * ™' ™" s ! the rcdu "dancy terms "stripe" 

F 2 is the replacement data for the second failed disk thr0Ugh the disk dnves ' 

drive. The equations which generate the P and Q redun- Pairs of P and Q Terms for Nibbles 

dancy terms are realized in combinatorial logic, as is , , « , ~ . „ 

partially shown in FIGS. 12a and 126. This has the Calculating the P and Q redundancy terms using 8-bit 

advantage of allowing the redundancy terms to be gen- Sy T bo,S t . WOU J d ; eqmre a **« of hardware. To 

erated and written to the disk drives at the same Time „ 1£L ^ na^d ^\ ove^hcad ^^ calculations are 

that the data is written to the drives. This mode of 45 pe^™^ using 4-bn bytes or nibbles. Tms hardware 

operation will be discussed later. implementation does not change the invention concep- 
tually, but does result m the disk drives receiving two 

Operational Modes 4-bit data nibbles combined to make one 8-bit byte. In 

Having described the aspects of the Reed-Solomon m ^^ 9 ^T^ ^-.f u th \ n,usl ? ted 
code implementation necessary to understand the pres- 50 L° rS . A of thc dnvcs > J 1 *"? how the codeword is 
ent invention, the operational modes of the present S^lf 1 ^. "^^^J^ rcce j v * *PP« and 

invention will now be discussed. lower 4-bit nibbles Table 2 shows how, for codewords 

The second preferred embodiment of the present I^^i^^ ? f "^ord is 

invention operates primarily in one of two classes of „ n^l f - ? rfvc ' for a 

operations. These are parallel data storage operations 55 g£? ^l°J f t TEF "ITS" J"? ^ 
and transaction processing operation. These two classes ™: d< ^ gn ?£ ed Wlth L s . and . U s ' ° f codeword, 
of operations will now be discussed with reference to (f^^l^^^f! 00 * U * d t0 store thc nib " 
the figures, particularly FIGS. 10, 13 and 14 and Tables Wef of dnvcs to store the code- 

2 through 7 word - In otncr words . «w codeword), the first sector, of 

60 disk drives n-1 through 0 receives the nibbles. 

TABLE 2 

CODEWORD - DATA AND P AND O 
Sector of Sector of Sector of Sector of Sector of 
Drive i Drive Drive dp Drive P Drive Q 



Codewordi Codewordi Codeword) Codewordi Codeword j Codeword | 

(do-lLXdVli/) (d^d^ry) (do^doc,) <?i L YP W ) (Q|^Q, V ) 

Codeword! Codeword: Codeword: Codeword 2 Codeword 2 Codeword^ 

W*-iz,Xd».u/) (<U2iXd».2t/) (doLXdoy) fl^XPji/) (QilHQiu) 
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TABLE 2-continued 



CODEWORD - DATA AND P AND 0 




Sector of 


Sector of 


Sector of Sector of 


Sector of 


Drive oV| 


Drive d&2 


Drive do Drive P 


Drive 0 


Codeword,, Codeword,, 


Codeword o 


Codeword,, Codeword, 


Codeword,, 






i (dotKdot/) (P«/.HP»l'> 


(QntKQnl.) 



Referring back to FIG. 10, for a parallel data write to 
the disks, the data is provided in parallel from buffers During a parallel read operation, in the event that 
330, 331. 332 and 333 along those data buses coupling there is a failure of a disk drive, the failed disk drive 
the buffers to X-bar switches 310, 311, 312, and 313 after 15 will, in certain instances, communicate to the second 
the 36-bitsof data are disassembled in word assemblers level controller that it has failed. The disk drive will 
350 through 353 into 9-bit words. These X-bar switches communicate with the second level controller if the 
are also coupled to inputs D3, D2, Dl an DO, respec- disk drive cannot correct the error using its own correc- 
tively, of ACC 34* and ACC 349. In parallel processing tor. The second level controller will then communicate 
modes, the two ACCs act as mutual "backups'* to one 20 with ACCs 348 and 349 by loading the failed drive 
another. Should all fail, the other will still perform the registers in the ACC (not shown in the figures) with the 
necessary error correcting functions. In addition to address of the failed drive. The failed drive can be re- 
operating in a purely "backup" condition, the second moved from the set by deleting its address from the 
level controller engine configures the ACCs so that configuration registers. One of the set's spare drives can 
each ACC is performing the error detection and correc- 25 then be used in place of the failed drive by inserting the 
tion functions for a portion of the set, the other ACC address of the spare drive into the configuration regis- 
performing these functions for the remaining disk drives ters. 

in the set. As the ACC units are still coupled to all the The ACC will then calculate the replacement data 
disk drives, failure of one or the other unit does not necessary to rewrite all the information that was on the 
impact the system as the operating ACC can be recon- 30 failed disk onto the newly activated spare. In this inven- 
figured to act as the dedicated ACC unit for the entire tion, the term spare or backup drive indicates a disk 
set. For purposes of discussion, it is assumed here that drive which ordinarily does not receive or transmit data 
ACC 348 is operating. ACC 348 will calculate the P and until another disk drive in the system has failed. 
Q redundancy term for the data in the X-bar switches When the data, P, and Q bytes are received, the ACC 
and provide the terms to its Ei and Ez outputs, which 35 circuits use the failed drive location in the failed drive 
outputs are coupled to all the X-bar switches. For dis- registers to direct the calculation of the replacement 
cussion only, it is assumed that only the E2 connection data for the failed drive. After the calculation is com- 
of X-bar switch 314 and the Ej connection of X-bar P'ete. the data bytes, including the recovered data, are 
switch 315 are enabled. Thus, although the data is pro- sent to data buffers in parallel. Up to two failed drives 
vided along the buses coupling ACC 348*s Ei and E 2 40 can be tolerated with the Reed-Solomon code imple- 
output to all the X-bar switches, the Q term is received mented herein. All operations to replace failed disk 
only by X-bar switch 314 and the P term is received by drives and the data thereon occur when the system is 
X-bar switch 315. Then, the Q and P terms are provided operating in a parallel mode. 

first to DSI units 344 and 345 and then disk drives 20A5 Regeneration of data occurs under second level con- 
and 20A6. It should be recalled that the various internal 45 troller control. When a failed disk drive is to be re- 
registers in the X-bar switches will act as a multi-stage placed, the ACC regenerates all the data for the re- 
pipeline, effectively slowing the transit of data through placement disk. Read/write operations are required 
the switches sufficiently to allow ACC 348's redun- until all the data has been replaced. The regeneration of 
dancy circuit 302 to calculate the P and Q redundancy the disk takes a substantial amount of time, as the pro- 
lerms * 50 cess occurs in the background of the system's opera- 

As ACC 349 is coupled to the X-bar switches in a tions so as to reduce the impact to normal data transfer 
substantially identical manner to ACC 348, the opera- functions. Table 3 below shows the actions taken for 
tion of the system when ACC 349 is operational is es- regeneration reads. In Table 3, i represents a first failed 
sentially identical to that described for ACC 348. drive and j represents a second failed drive. In Table 3 

Subsequent parallel reads from the disks occur in the 55 the column labelled Failed Drives indicates the particu- 
following manner. Data is provided on bi-directional lar drives that have failed. The last column describes the 
buses to DSI units 340, 341, 342 and 343. P and Q redun- task of the ACC given the particular indicated failure 
dancy terms are provided by DSI units 345 and 344, TAm F , 

respectively. As the data and P and Q terms are being — ■ ■ 

transferred through X-bar switches 310 through 315, 60 Re K eneraii on Read 

ACC 348 uses the P and Q terms to determine if the data 

being received from the disk drives is correct. Word '• — ■ _ 

assemblers 350 through 353 assemble successive 9-bit n ^ "I"! 81 " p r «*undancy 

words Until the next arc available. This 36-bitS ? Z ACC calculates n^cmemdm for i drive 

are forwarded to buffers 330 through 333. Note that the 65 * P ACC calculate* replacement data for i drive 

9-bit words are transmitted to the buffers in parallel. If and P redundancy 

that data is incorrect, the second level controller will be Q ' AC< ^ calcuIales replacement for i drive 

informed * n *^ redundancy 

j i ACC calculates replacement data for i and j drive* 
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Failed 
Drives 



RegeneraiioD Rod 



P Q ACC calculates P and Q redundancy 



structed data is provided to its buffer, buffer 332, since 
this is the only data the external computer needs. 

Transaction Processing Mode: Write 
When any individual drive is written to, the P and Q 
redundancy terms must also be changed to reflect the 
new data (see FIG. 18). This is because the data being 
written over was part of a code word extending over 
multiple disk drives and having P and Q terms on two 



It should be noted that if both a data disk drive and a 

redundancy disk drive fail, the data on the data disk , IJimipJC ai5K onvC s ana navmg P and O terms on two 

tn2Z* b " c &™ m *«^ 10 disk drives. The previously stored P and S wOl 

on the redundancy drive. Dunn, a r^™,;™ ^ no longer ^ ^ wh J ?m J ™ ^J^riS 

changed, so new P and Q terms, P" and Q", must be 
calculated and written over the old P and Q terms on 
their respective disk drives. P" and Q" will then be 
proper redundancy terms for the modified code word. 

One possible way to calculate P" and Q" is to read 
out the whole codeword and store it in the buffers. The 
new portion of the codeword for drive 20C1 can then 
be supplied to the ACC circuit along with the rest of the 
and q„ can ^ ca j cu | atccJ 



15 



on the redundancy drive. During a regeneration write, 
regeneration data or redundancy terms are written to a 
disk and no action is required from the ACC logic. 

During a parallel read operation, it should also be 
noted that additional error detection may be provided 
by the ACC circuitry. 

Table 4 indicates what actions may be taken by the 
ACC logic unit when the indicated drive(s) has or have 

failed during a failed drive read operation. In this opera- t* supplied to the ACC circ 
tion, the drives indicated in the Failed Drives columns 20 codeword, and the new P' 



are known to have failed prior to the read operation. 
The last column indicates the ACC response to the 
given failure. 

TABLE 4 



Failed 
Drives 


P 


No action by ACC 


Q 


No action by ACC 


i 


ACC calculates replacement data 


i P 


ACC calculates the replacement data 


0 i 


ACC calculates ihe replacement data 


> i 


ACC calculates replacement data 


P Q 


No action by ACC 



Transaction Processing Mode: Read 
Transaction processing applications require the abil- 
ity to access each disk drive independently. Although 
each disk drive is independent, the ACC codeword 
with P and Q redundancy is maintained across the set in 
the previously described manner. For a normal read 
operation, the ACC circuitry is not generally needed. If 
only a single drive is read, the ACC cannot do its calcu- 
lations since it needs the data from the other drives to 
assemble the entire codeword to recalculate P and Q 
and compare it to the stored P and Q. Thus, the data is 
assumed to be valid and is read without using the ACC 
circuitry (see FIG. 15). Where drive 20C1 is the one 



and stored on their disk drives as for a normal parallel 
write. However, if this method is used, it is not possible 
to simultaneously do another transaction mode access 
25 of a separate disk drive (i.e., drive 20A1) having part of 
the codeword, since that drive (20 A 1) and its buffer are 
needed for the transaction mode write for the first drive 
(20C1). 

According to a method of the present invention, two 
3Q simultaneous transaction mode accesses are made possi- 
ble by using only the old data to be written over and the 
old P and Q to calculate the new P" and Q" for the new 
data. This is done by calculating an intermediate P' and 
Q' from the old data and old P and Q, and then using P' 
35 and Q' with the new data to calculate the new P" and 
Q". This requires a read-modi fy-write operation on the 
P and Q drives. The equations for the new P and Q 
redundancy are: 



40 



45 



New P redundancy (P*')t: 
data 



(old P-old data)+new 



New 0 redundancy (Q")=(old Q-old 
data-a/)+new data -a/ 

P*=old P-old data 

Q'^old Q-old dataa, 



Where a,- is the coefficient from the syndrome equa- 



, — , wv w.i^v » uic uuc "»wctt,» me cocmcieni iroro tne 

selected, the data is simply passed through DSI unit 342 50 tion Si; and i is the index of the drive 
X -bar switch 312. word awmhUr M? onH Kiifr« r mi n.mnn «u« ™j * , 



X-bar switch 312, word assembler 352 and buffer 332 to 
the external computer. If the disk drive has failed, the 
read operation is the same as a failed drive read in paral- 
lel mode with the exception that only the replacement 



During the read portion of the read-modify-write, the 
data from the drive to be written to and the P and Q 
drives are summed by the ACC logic, as illustrated in 
FIG. 17. This summing operation produces the P' and 



i«t jcpmtcracm * *w. */. summing operation produces the P' and 

data generated by the ACC is sent to the data buffer. In 55 Q' data. The prime data is sent to a data buffer When 

this case, the disk drive mutt nntifu the. c<»s^ n< 4 tho n»ui Hot« . J... i rr *i 



this case, the disk drive must notify the second level 
controller that it has failed, or the second level control- 
ler must otherwise detect the failure. Otherwise, the 
second level controller will not know that it should read 
all the drives, unless it assumes that there might be an 60 
error in the data read from the desired drive. The failed 
drive read is illustrated in FIG. 16, with drive 20C1 
having the desired data, as in the example of FIG. 15. In 
FIG. 16, the second level controller knows that drive 
20C1 has failed, so the second level controller calls for 65 
a read of all drives except the failed drive, with the 
drive 20C1 data being reconstructed from the data on 
the other drives and the P and Q terms. Only the recon- 



the new data is in a data buffer, the write portion of the 
cycle begins as illustrated in FIG. 18. During this por- 
tion of the cycle, the new data and the P' and Q' data are 
summed by the ACC logic to generate the new P" and 
Q" redundancy. When the summing operation is com- 
plete, the new data is sent to the disk drive and the 
redundancy information is sent to the P and Q drives. 

Parity Check of P and Q for Transaction Mode Write 
During these read-modify-write operations, it is also 
possible that the ACC unit itself may fail. In this case, if 
the data in a single element were to be changed by a 
read-modify-write operation, a hardware failure in the 
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Failed Drives 



J 



10 



All good data disk drives are read into data 
buffers 

All good data disk drives are read into data 
butters 

i failed drive Perform a parallel read, the ACC logic 

calculates the replacement data for the jlh 
failed drive. Next, the remaining good data 
disk drives are read into the data buffers. 

P Q No read before write operation h necessary 



ACC might result in the redundancy bytes for the new 
data being calculated erroneously. To prevent this oc- 
currence, the parity detector and parity generator are 
made part of the ACC circuitry. This additional redun- 
dant circuit is shown in FIGS. 14a and 146 and resides 
within redundancy circuit 302 as shown in FIG. 11. 
When data is received by the ACC circuitry, parity is 
checked to insure that no errors have occurred using 
the P and Q redundancy terms. In calculating Q", new 
parity is generated for the product of the multiply oper- 
ation and is summed with the parity of the old Q" term. 
This creates the parity for the new Q term. For the P 
byte, the parity bits from the data are summed with the When a faiIed dala disk drive is to be written, all good 
parity bit of the old P term to create the new parity bit data disk drivc * be read so that a new P and Q 
for the new P" term. Before writing the new data back 15 redundancy can be generated. All of the data from the 
to the disk drive, the parity of Q' (calculated as indi- good data disk drivc toe write data is summed to 

" generate the new redundancy. When two data disk 

drives fail, the ACC logic must calculate replacement 
data for both failed drives. If only one drive is to be 
20 read, both must be reported to the ACC logic. 

During write operations, the ACC continues to cal- 
culate P and Q redundancy. Table 7 shows the ACCs 
tasks during failed drive writes. Here P and Q represent 
the P and Q redundancy term disk drives, and i and j 



cated previously) is checked. Should Q' be incorrect, 
the second level controller engine will be informed of 
an ACC failure. In this manner, a failure in the ACC can 
be detected. 

The same operations are performed for a failed disk 
drive write in transaction processing operations as for 
parallel data writes, except that data is not written to a 
failed drive or drives 



— ~» »»• J w.*.n <M* I . BMW I auu J 

With respect to transaction processing functions dur- 25 represent the first and second failed data disk drives. 
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ing normal read operations, no action is required from 
the ACC logic. The actions taken by the ACC logic 
during a failed drive read in transaction processing ode 
are listed in Table 5 below, where i and j represent the 
first and second failed drives. The columns labelled 
Failed Drives indicate which drives have failed. The 
last column indicates what action the ACC may or may 
not take in response to the indicaied failure. 

m TABLE 5 

Failed 
Drives 

P 
0 



30 



35 



The columns Failed Drives denote the particular failed 
drives, and the last column indicates the ACC response 
to the failed drives. 

TABLE 7 



Failed 
Drives 



Redundancy drives are noi read; no ACC action 
Redundancy drives are not read: no ACC action 
ACC logic calculates replacement data and 
performs a parallel read 
ACC logic calculates replacement data and 
performs a parallel read 
ACC logic calculates rep lac em en I dala and 
performs a parallel read 
ACC logic calculates replacement data and 
performs a parallel read 

No ACC action as only data disk drives ate read 



P 
Q 
i 
t 

Q 
i 

P 



ACC calculates 0 redundancy only 
ACC calculates P redundancy only 
ACC calculates P and Q redundancy 
ACC calculates 0 redundancy only 
ACC calculates P redundancy only 
ACC calculates P and Q redundancy 
ACC logic takes no action 



40 



Summary of ECC 

The interconnected arrangements herein described 
relative to both preferred embodiments of the present 
invention allow for the simultaneous transmission of 
45 data from all disks to the word assemblers or vice versa. 
Data from or to any given disk drive may be routed to 
any other word assembler through the X-bar switches 
If two data disk drives fail, the ACC logic must calcu- und ^ r second level controller engine control. Addition- 
late the needed replacement data for both disk drives. If aHyrdata in any word assembler may be routed to any 
only one failed drive is to be read, both failed drives 50 disk drive through the X-bar switches. The ACC units 

mitct cttll Kp Kv th A a fr* 1ai.; a receive fill data frnm all Y.hor cun<pfix> t_. 
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must still be noted by the ACC logic. 

In the read-before-write operation (part of the read- 
modify-write process), the ACC logic generates P' and 
Q' redundancy terms. Table 6 shows the action taken by 
the ACC logic when a failed disk drive read precedes a 
write in this process. Again, i and j represent the first 
and second failed drives. The columns headed by Failed 
Drives indicate which drives have failed, and the last 
column denotes the response of the ACC to the indi* 
cated failures. 

TABLE 6 

Failed Dnve> 

P — ACC calculate! Q' only 

0 — ACC calculates P* only 

> — ACC logic lakes no action and all good data 

disk drives are read into dala buffers 

1 P All good.data disk drives are read into data 

buffers 



receive all data from all X-bar switches simultaneously. 
Any given disk drive, if it fails, can be removed from 
the network at any time. The X-bar switches provide 
alternative pathways to route data or P and Q terms 
55 around the failed component. 

The parallel arrangement of disk drives and X-bar. 
switches creates an extremely fault-tolerant system. In 
the prior art, a single bus feeds the data from several 
disk drives into a single large buffer. In the present 
60 invention, the buffers are small and one buffer is as- 
signed to each disk drive. The X-bar switches, under 
control of the ACC units, can route data from any given 
disk drive to any given buffer and vice versa. Each 
second level controller has several spare disks and one 
65 spare buffer coupled to it. The failure of any two disks 
can be easily accommodated by switching the failed 
disk from the configuration by means of its X-bar 
switch and switching one of the spare disks onto the 
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network. The present invention thus uses the error into various "extents" each defined as a portion of the 
detection and correction capabilities of a Reed-Solo- depth of the redundancy group and each capable of 
mon error correction code in an operational environ- having a configuration of check data different from that 
ment where the system's full operational capabilities of other extents in the same redundancy group. More- 
can be maintained by reconfiguring the system to cope 5 ver, it has been found that more than ne redundancy 
with any detected disk or buffer failure. The ACC can group can be provided in a single device set, under the 
correct and regenerate the data for the failed disk drive control of a single "array controller" and connected to 
and, by reconfiguring the registers of the failed and a main processing unit via one or more device control- 
spare disk drives, effectively remove the failed drive lers. 

from the system and regenerate or reconstruct the data 10 Similarly, in previously known device sets, the single 

from the failed disk onto the spare disk. redundancy group included only one data group for 

Disk Drive Configuration and Fonna, SSE Sf£ 

The present invention allows a set of physical mass dundancy group can be broken up into multiple data 
data storage devices to be dynamically configured as 15 groups, each of which can operate as a separate logical 
one or more logical mass storage devices. In accor- storage device or as part of a larger logical storage 
dance with the present invention, such a set of physical device. A data group can include all available m« $s 
devices is configurable as one or more redundancy storage memory on a single physical device (i.e., all 
groups and each redundancy group is configurable as memory on the device available for storing application 
one or more data groups. 20 data), or it can include all available mass storage mem- 

A redundancy group, as previously used in known ory on a plurality of physical devices in the redundancy 
device sets, is a group of physical devices all of which group. Alternatively, as explained more fully below, a 
share the same redundant device set. A redundant de- data group can include several physical devices, but 
vice is a device that stores duplicated data or check data instead of including all available mass storage memory 
for purposes of recovering stored data if one or more of 25 of each device might only include a portion of the avail- 
the physical devices of the group fails. able mass storage memory of each device. In addition, it 

Where check data is involved, the designation of a has been found that it is possible to allow data groups 
particular physical device as a redundant device for an from different redundancy groups to form a single logi- 
entire redundancy group requires that the redundant cal device. This is accomplished, as will be more fully 
device be accessed for all write operations involving 30 described, by superimposing an additional logical layer 
any of the other physical devices in the group. There- on the redundancy and data groups, 
fore, all write operations for the group interfere with Moreover, in previously known device sets in which 
one another, even for small data accesses that involve application data is interleaved across the devices of the 
less than all of the data storage devices. set, the data organization or geometry is of a very sim- 

It is known to avoid this contention problem on write 35 pie form. Such sets generally do not permit different 
operations by distributing check data throughout the logical organizations of application data in the same 
redundancy group, thus forming a logical redundant logical unit nor do they permit dynamic mapping of the 
device comprising portions of several or all devices of logical organization of application data in a logical unit, 
the redundancy group. For example, FIG. 19 shows a It has been found that the organization of data within a 
group of 13 disk storage devices. The columns represent 40 data group can be dynamically configured in a variety 
the various disks D1-D13 and the rows represent differ- of ways. Of particular importance, it has been found 
ent sectors S1-S5 on the disks. Sectors containing check that the data stripe depth of a data group can be made 
data are shown as hatched. Sector SI of disk D13 con- independent of redundancy group stripe depth, and can 
tains check data for sectors of disks D1-D12. Likewise, be varied from one data group to another within a logi- 
the remaining hatched sectors contain check data for 45 cal unit to provide optimal performance characteristics 
their respective sector rows. Thus, if data is written to for applications having different data storage needs, 
sector S4 of disk D7, then updated check data is written An embodiment of a mass storage system 500 tnclud- 
into sector S4 of disk D10. This is accomplished by ing two second level controllers 14A and 14B is shown 
reading the old check data, re-coding it using the new in the block diagram of FIG. 20. As seen in FIG. 20, 
data, and writing the new check data to the disk. This 50 each of parallel sets 501 and 502 includes thirteen physi- 
operation is referred to as a read-modify-write. Simi- cal drives 503-515 and a second level controller 14. 
larly, if data is written to sector SI of disk Dll f then Second level controller 14 includes a microprocessor 
check data is written into sector SI of disk D13. Since 5l6o which controls how data is written and validated 
there is no overlap in this selection of four disks for across the drives of the parallel set. Microprocessor 
writes, both read-modify-write operations can be per- 55 516a also controls the update or regeneration of data 
formed in parallel. when one of the physical drives malfunctions or loses 

A distribution of check data in a redundancy group in synchronization with the other physical drives of the 
the manner shown in FIG. 19 is known as a striped parallel set. In accordance with the present invention, 
check data configuration. The term "striped redun- microprocessor 51*a in each second level controller* 14 
dancy group'* will be used herein to refer generally to a 60 also controls the division of parallel sets 501 and 502 
redundancy group in which check data is arranged in a into redundancy groups, data groups and application 
striped configuration as shown in FIG. 19, and the term units. The redundancy groups, data groups and applica- 
"redundancy group stripe depth" will be used herein to tion units can be configured initially by the system oper- 
refer to the depth of each check data stripe in such a ator when the parallel set is installed, or they can be 
striped redundancy group. 65 configured at any time before use during run-time of the 

In previously known device sets, it was known to parallel set. Configuration can be accomplished, as de- 
provide the whole set as a single redundancy group. It scribed in greater detail below, by defining certain con- 
has been found that a redundancy group can be divided figuration parameters that are used in creating various 
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address maps in the program memory of microproces- 
sor 51 6o and, preferably, on each physical drive of the 
parallel set. 

Each of second level controllers 14A and 14B is con- 
nected to a pair of first level controllers 12A and 12B. 
Each first level c ntr Her is in turn connected by a bus 
or channel 522 to a CPU main memory. In general, each 
parallel set is attached to at least two sets of controllers 
so that there are at least two parallel paths from one or 



30 



and cylinders— in a group 716 of eight single-platter 
two-sided drives 700-707 in a manner well-suited to 
illustrate the present invention. Drives 700-707 may, for 
example, correspond to drive units 503-510 of parallel 
set 501 or 502. Each of the small horizontal divisions 
represents a sector 708. For each drive, four cylinders 
709-712 are shown, each cylinder including two tracks 
713 and 714, each track including five sectors. 
In the preferred embodiment shown in FIG. 22, 



rmi . . r 7 'TV " — ~* *» me pjcicucu emooaimem snown in MU. ZZ, 

more CPU main memories to that parallel set. Thus, for 10 group 716 comprises a single redundancy group in 



example, each of the second level controllers 14A and 
14B is connected to first level controllers 12A and 12B 
by buses 524 and 526. Such parallel data paths from a 
CPU to the parallel set are useful for routing data 
around a busy or failed first or second level controllers 
as described above. 

Within each parallel set are an active set 528 compris- 
ing disk drive units 503-514, and a backup set 530 com- 
prising disk drive unit 515. Second level controller 14 



which two types of redundancy data, referred to as "P" 
check data and "Q" check data, are used to provide data 
redundancy. The P and Q check data are the results of 
a Reed-Solomon coding algorithm applied to the mass 
15 storage data stored within the redundancy group. The 
particular method of redundancy used is implementa- 
tion specific. As shown, the redundancy data is distrib- 
uted across all spindles, or physical drives, of group 716, 
thus forming two logical check drives for the redun- 



- _ . — - — *-r uiw luiiumg iwu lugicai cnccK anves ior tne redun- 

routes data between first level controllers 12 and the 20 dancy group comprising group 716. For example the P 
annronnale one or nn«K of Hiclr H™a unite cnvcic i r , . : .„ : H * 



appropriate one or ones of disk drive units 503-515. 
First level controllers 12 interface parallel sets 501 and 
502 to the main memories of one or more CPUS; and 
are responsible for processing I/O requests from appli- 
cations being run by those CPUs. A further description 25 
of various components of the apparatus of parallel sets 
501 and 502 and first level controllers 12 can be found in 
the following co-pending, commonly assigned U.S. 
patent applications incorporated herein in their entirety 



and Q check data for the data in sectors 708 of cylinders 
709 of drives 700-705 are contained respectively in 
cylinders 709 of drives 706 and 707. Each time data is 
written to any sector 708 in any one of cylinders 709 of 
drives 700-705, a read-modify- write operation is per- 
formed on the P and Q check data contained in corre- 
sponding sectors of drives 706 and 707 to update the 
redundancy data. 
Likewise, cylinders 710 of drives 700-707 share P and 



b Jn?^ ^^v^^r!?? 30 Q < h ~ k d * la ***** ™ of drives 7u4 



VOLATILE MEMORY STORAGE OF WRITE 
OPERATION IDENTIFIER IN DATA STORAGE 
DEVICE," filed in the names of David T. Powers, 
Randy Katz, David H. JafTe, Joseph S. Glider and 
Thomas E. Idleman; and Ser. No. 07/488,750 entitled 
"DATA CORRECTIONS APPLICABLE TO RE- 
DUNDANT ARRAYS OF INDEPENDENT 
DISKS/* filed in the names of David T. Powers, Joseph 
S. Glider and Thomas E. Idleman 



and 705; cylinders 711 of drives 700-707 share P and Q 
check data contained in cylinders 711 of drives 702 and 
703; and cylinders 712 of drives 700-707 share P and Q 
check data contained in cylinders 712 of drives 700 and 
35 701. 

Three data groups D1-D3 are shown in FIG. 22. 
Data group Dl includes cylinders 709 of each of spin- 
dles 700, 701. Data group D2 includes cylinders 709 of 
each of spindles 702, 703. Data group D3 includes all 



# — *» ~i /w . Lyam group uo memoes ai 

To understand how data is spread among the various 40 remaining cylinders of spindles 700-707 with the ex 

woral rtriupc nf an artivp cot Ofi rt f o n4rn ii A i cm _r .» . . . ' 



physical drives of an active set 528 of a parallel set 501 
or 502, it is necessary to understand the geometry of a 
single drive. FIG. 21 shows one side of the simplest type 
of disk drive— a single platter drive. Some disk drives 
have a single disk-shaped "platter" on both sides of 45 
which data can be stored. In more complex drives, there 
may be several platters on one "spindle." which is the 
central post about which the platters spin. 
As shown in FIG. 21, each side 600 of a disk platter 



ception of those cylinders containing P and Q check 
data. Data group Dl has a two-spindle bandwidth, data 
group D2 has a four-spindle bandwidth and data group 
D3 has a six-spindle bandwidth. Thus it is shown in 
FIG. 22 that, in accordance with the principles of the 
present invention, a redundancy group can comprise 
several data groups of different bandwidths. In addition, 
each of data groups D1-D3 may alone, or in combina- 
tion with any other data group, or groups, comprise a 



....... . » " — „ 1 — «»jr viuci uaiu group, or groups, comprise i 

is divided mto geometric angles 601, of which eight are 50 separate logical storage device. This can be accom 



shown in FIG. 21, but of which there could be some 
other number. Side 600 is also divided into ring-shaped 
"tracks" of substantially equal width, of which seven 
are shown in FIG. 21. The intersection of a track and a 



plished by defining each data group or combination as 
an individual application unit. Application units are 
discussed in greater detail below. 
In FIG. 22, sectors 708 are numbered within each 



... • — — - — * * w - **t ivo sue numoerea wiinin each 

geometric angle is known as a sector and typically is the 55 data group as a sequence of logical data blocks Tim 

most basic unit of storsc** in a Hict H^v,» Tl . L _ 7 * * i nis 



most basic unit of storage in a disk drive system. There 
are fifty-six sectors 603 shown in FIG. 21. 

A collection of tracks 602 of equal radius on several 
sides 600 of disk platters on a single spindle make up a 
"cylinder." Thus, in a single-platter two-sided drive, 
there are cylinders of height = 2, the number of cylin- 
ders equalling the number of tracks 602 on a side 600. In 
a two-platter drive, then, the cylinder height would be 
4. In a one-sided single-platter drive, the cylinder height 
is 1. 

A disk drive is read and written by "read/write 
heads" that move over the surfaces of sides 600. FIG. 22 
shows the distribution of data sub-units— sectors, tracks 



sequence is defined when the data groups are config- 
ured, and can be arranged in a variety of ways, FIG. 22 
presents a relatively simple arrangement in which the 
sectors within each of data groups D1-D3 are num- 
60 bered from left to right in stripes crossing the width of 
the respective data group, each data stripe having a 
depth of one sector. This arrangement permits for the 
given bandwidth of each data group a maximum paral- 
lel transfer rate of consecutively numbered sectors. 

The term "data group stripe depth" is used herein to 
describe, for a given data group, the number of logically 
contigu us data sectors stored on a drive within the 
boundaries of a single stripe of data in that data group. 



65 
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Iior^H^^ l ^ epriDdp, " 0f * e ^f ntin u en * 71,6 ^^ion f redundancy groups and data 
^ l 2 W ? "P*"^ te ,esser ,han - groups over active set 528 of a parallel set 501 or 502 
greater than or equal to the depth f redundancy group can be parameterized. For example, the redundancy 

^r K Dl%^ e ^ FIG - 22S,, T , !: a, , da,a 8rOUp <*» * characterized by I redundancy £Z 
SFni'Sf^i^J 8 ^ 8 ^^^ 0 . " e 5 ^ ,h ^ ^"^representing the number of sphdles 
S?fi2r ? mC,Uded " 8 r edu " da " c y 8™° hav * SP 8 ""* by » Particular set of check data, a redundancy 

R*m?^r^?^ h ° f °" e - C !! ,inder -, m depth r,n ^ ^unit-sector, track ofcyHnder) 

Redundancy group 716 can handle up to six data read and a redundancy group stripe depth (also in anv 

requests smiultaneously-one from each of spindles subunit-sector. track or cylindS) Da a groups can 

700-705-becauselhe read/write heads of the spindles 10 characterized by width (in spindles? d»th K 

^,716 Z ^Tr"^ - f °«r an ?i h T Redm ! danc y subunit-sector. track or cylinder), and data grS 

group 716 as configured m FIG. 22 also can handle stripe depth (also in any subunit sector, back or cvlni- 

^?^^!!. b,?a,,0nS ° f Wri,C TtqW T der). Because data groups do not s^ onl7« the 

Sn^S?^ 'IL many m$tanC ^ 8ny da . ta SeCt0r of da,a of active *« «*' «»«ey are also characterized bTa 

S?" «* amu haneously ™«h any data 15 "base", which isa two-parameter indication of *esp^ 

sectors of data group D3 contained on spindles 702-705 die and the offset from the betinnint of X ° Z 

Redundancy group 7,6 as configured FIG. 22 usu- £ ^Sn' 
ally cannot handle simultaneous write operations to 20 dancy group may bedividWintS a ^SuraWy ^f eSn," 
sectors .n data groups Dl and D2. however, because to The extents of a redundancy gropC?S.S£ 
perform a wme operation in either of thesedata groups, and different bases and depths. For each e«en? Ae 
^ l n n^, t0 Wn,C • ° driVe \ 7W a ? d 707 05 we,L <««ribution of check data therein can be indeJendentW 
Only one write .operation can be performed on the parameterized. In the preferred embodiment each Z 
check data of drives 706. 707 at any one time, because 25 dundancy group extent has additionaHnS Tarame- 

.wT^' he8dS ST* ° n l y u° e in ° ne p,ace at one ,ers ' such » * e ^pth of each redundancy group «rii 

time. Likewise, regardless of the distribution or data within the redundancy group extent and the drive 

groups, write operations to any two data sectors backed tion of the P and Q check data for each such «Z- 

up by check data on the same drive cannot be done dancy group stripe 

trch^Hriv'J^K^ f ° r ,h V cad/wri | e hea °* of * Redundancy group width reflects the trade-off be- 

7 „ ^ ™»- ,han .. 0ne P ' aCe at ° ne ,wcen reliabUi, y ™ d "P^'V- If «he redundancy group 

T if £J 1 ? : c0 "! s,on - ™™ * then greater capacity is available, because 

JL L ™ Underst0 ? d , 1 ha ' !he above-described re- only two drives out of a large number are used for 

EE £?J COn . ce ™ n8 s,raul,aneo " S r. nteS . ,0 difreren ' check d8,a - leavi "8 «he remaining drives for dau. At 
data drives shanng common check dnves is peculiar to 35 the other extreme, if the redundancy group wid h=4 
check drive systems, and is not a limitation of the inven- then a situation close to mirroring or shadowing ,n 
tion For example, the restriction can be avoided by which 50% of the drives are used for check dauTexis* 
implementing the invention using a. mirrored redun- (although with mirroring if the correct two drives out 
dancy group, which does not have the property that of four fail, all data on them could be lost, while with 
different data drives share redundancy data on the same 40 check data any two drives could be regenerated in that 
-m ,t.~ - , . situation). Thus low redundancy group widths repre- 

FIG 23 shows a more particularly preferred embodi- sent greater reliability, but lower capacity per unit cost 
ment ofredundancy group 716 configured according to while high redundancy group wdt^epre^nr^ter 
he present invention In FIG. 23, as in FIG. 22, the capacity per unit cost with lower, but stiTreEly 
logical check "drives" are spread among all of spindles 45 high, reliability. relatively 

IK 7 n °". a P V cy 1 ' i " de . r basjs> a,,hou * h ,he y cou,d Da,a ^oup width reflects the trade-off discussed 

also be on a per-track basis or even a per-sector basis. above between bandwidth and request rate, with hich 

^Data groupsBl and D2 are configured as in FIG. 22. data group width reflecting high bandwidVand low 

The sectors of data group D3 of FIG. 22, however. data group width reOectrng high requeTra « 
have been divided among four data groups D4-D7. As 50 Data group stripe depth ako reflects a tradeoff be- 

Z„„^m ^ thC Se r cnc,n « of s f c,0 . re «*■ bandwidth and request rate. This tradeoff varte 

groups D4-D7 is no longer the same as the single-sec- depending on the relationship of the average size of I/O 

tor-deep striping of data groups Dl and D2. Data group requests to the data group and the depth of d?J stripes 

d^h d nf a ,n P . $,npe depth , f ^ $ect .°"-* qUal 10 in ,hc data P°»P- ™< rdatior^hip of average 1/S«! 
mil? n ** ?° UP ta ? t ;J hm ' m data grou P " *** si*e to the data group stripe depth governs how 
D4 logically numbered sectors 0-19 can be read consec- often an I/O request to the T datagroup wUl spaTmore 
utively by accessing only a single spindle 700, thereby than one read/write head within tKta^oTiuhus 
allowing the read/wnie heads of spindles 701-707 to also governs bandwidth and request rate Kb bind! 
handle other transactions Data groups D5. D6 and D7 width is favored, the data group stripe^dVlh b prdSri 
itr^/ d,ffCr 5 m wtennediate data « bly chosen such that the ratio of av^gT* ^reques 
JS ^ 0fS ' 2 * Cto " " ,d * SCC,0rS ' ,0 ^ de P ,h 55 ^ A large ratil res^ts^l/O 

^'.7; . , L t , requests being more likely to span a plurality of data 

The distribution of the check data over the various drives, such that the requested data can oe accessed 
sp nd es can be chosen in such a way as to minimize higher bandwidth than if the data were lo^eddl on 
co lhs.ons. Further, given a particular distribution, then 65 one drive. If, on the other hand a high reS me k 
to heextentthatsecondlevelcontrollerMhasachoice favored, the data group stripe deX b prafwaWv 
™£ ° rder ,i 0f °>* nU ™' the order «» be «o chosen such that theratk, of T/O request ^siL ,0 oata 

minimize collisions. ^ d h js ^ ^ratio r«ul« in a 
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lesser likelihood that an I/O request will span more than 
one data drive, thus increasing the Kkelihood that multi- 
ple I/O requests to the data group can be handled simul- 
taneously. 

The variance of the average size of I/O requests 
might also be taken int account in choosing data group 
stripe depth.. For example, for a given average I/O 
request size, the data group stripe depth needed to 
achieve a desired request rate might increase with an 
increase in I/O request size variance. 

In accordance with the present invention, the flexibil- 
ity of a mass storage apparatus comprising a plurality of 
physical mass storage devices can be further enhanced 
by grouping data groups from one or from different 
redundancy groups into a common logical unit, referred 
to herein as an application unit. Such application units 
can thus appear to the application software of an operat- 
ing system as a single logical mass storage unit combin- 
ing the different operating characteristics of various 
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ing respectively 40 data blocks numbered 0-39 corre- 
sponding to logical blocks LBN200-239 of logical unit 
LUN1, and 180 data blocks numbered 0-179 corre- 
sponding to logical blocks LBNO-LBN179 of logical 
unit LUN1. As shown by the example of FIG. 24, the 
logical blocks of a logical unit can be mapped as desired 
to the data blocks of one or more data groups in a vari- 
ety of ways. Data group address space 806 also includes 
additional data groups (D4) and (D5) reserved for dy- 
namic configuration. These data groups can be format- 
ted on the disk drives of the parallel set at initialization 
or at any time during the run-time of the parallel set, but 
are not available to the application software in the initial 
configuration of the parallel set. 

The redundancy group configuration of the parallel 
set is illustrated by a two dimensional address space 808, 
comprising the entire memory space of the parallel set. 
The horizontal axis of address space 808 represents the 
thirteen physical drives of the parallel set, including the 



- ---w — ~- v — ' ynjrai^ai ui i v« ui uic paranei set, including the 

data groups. Moreover, the use of such application units 20 twelve drives of active set 528 and the one spare drive 

permits data groups and redundant groups to be config- of backup set 530. In FIG. 24, the drives of the active 

ured as desired by a system operator independent of any set are numbered 0-11 respectively to reflect their logi- 

particular storage architecture expected by application cal positions in the parallel set. The vertical axis of 

software. This additional level of logical grouping, like address space 808 represents the sectors of each physi- 

the redundancy group and data group logical levels, is 25 ca) drive. As shown by redundancy group address space 
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controlled by second level controller 14. 

FIG. 24 illustrates an example of how application 
units, data groups and redundancy groups might be 
mapped to a device set such as parallel set 501 or 502, at 
initialization of the parallel sei. 

Referring first to the linear graph 800 of logical unit 
address space, this graph represents the mass data stor- 
age memory of the parallel set as it would appear to the 
application software of a CPU operating system. In the 
particular example of FIG. 24, the parallel set has been 
configured to provide a logical unit address space com- 
prising two application (logical) units (LUNO and 
LUN1). Logical unit LUNO is configured to include 20 
addressable logical blocks having logical block numbers 
LBN0-LBN19. As shown by FIG. 24, logical unit 40 
LUNO also includes an unmapped logical address space 
802 that is reserved for dynamic configuration. Dy- 
namic configuration means that during run-time of the 
parallel set the CPU application software can request to 
change the configuration of the parallel set from its 45 
initial configuration. In the example of FIG. 24, un- 
mapped spaces 802 and 804 are reserved respectively in 
each of logical units LUNO and LUNl to allow a. data _ 
group to be added to each logical unit without requiring 
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that either logical unit be taken off line. Such dynamic 50 space not mapped to either logical unit LUNO or 

configuration capability can be implemented by provid- nixM-ru: : — — /• . - 

ing a messaging service for a CPU application to re- 
quest the change in configuration. On behalf of mass 
storage system 500, the messaging service can be han- 
dled, for example, by the first level controllers 12. Logi- 55 
cal unit LUNl includes a plurality of addressable logi- 
cal blocks LBNO-LBN179 and LBN200-LBN239. The 
logical blocks LBN180-LBN199 are reserved for dy- 
namic configuration, and in the initial configuration of 
the parallel set, as shown in FIG. 24, are not available to 60 
the application software. 

The mass storage address space of logical unit LUNO 
comprises a single data group Dl, as shown by data 

group address space chart 806. Data group Dl includes « « oiau »ui n^cssary mai me enure depth or the 
20 logically contiguous data blocks 0-19, configured as 65 parallel set be included in redundancy group RGO As 
shown in FIG. 22 and corresponding one to one with an example, FIG. 24 shows that above and below re- 
• b, ? C £ numbers LBNO-LBN19. Logical unit dundancy group RGO are portions 810 and 811 of mem- 
LUN1 includes two data groups D2 and D3, compris- ory space 808 that are not included in the redundancy 



808, the parallel set has been configured as one redun- 
dancy group RGO having three extents A, B and C As 
can be seen, the width of each extent is equal to that of 
the redundancy group RGO: 12 logical drive positions 
or, from another perspective, the entire width of active 
set 528. 

Extent A of redundancy group RGO includes sectors 
1-5 of drives 0-11. Thus, extent A of redundancy group 
RGO has a width of 12 spindles, and an extent depth of 
5 sectors. In the example of FIG. 24, extent A is pro- 
vided as memory space for diagnostic programs associ- 
ated with mass storage system 500. Such diagnostic 
programs may configure the memory space of extent A 
in numerous ways, depending on the particular diagnos- 
tic operation being performed. A diagnostic program 
may, for example, cause a portion of another extent to 
be reconstructed within the boundaries of extent A, 
including application data and check data. 

Extent B of redundancy group RGO includes all ap- 
plication data stored on the parallel set. More particu- 
larly, in the example of FIG. 24, extent B includes data 
groups Dl, D2 and D3 configured as shown in FIG. 22, 
as well as additional memory space reserved for data 
groups (D4) and (D5), and a region 809 of memory 
space not mapped to either logical unit LUNO or 
LUNl. This region 809 may, for example, be mapped to 
another logical unit (e.g., LUN2) being used by another 
application. 

Address space 808 also includes a third extent C in 
which a second diagnostic field may be located. Al- 
though the parallel set is shown as including only a 
single redundancy group RGO. the parallel set may 
alternatively be divided into more than one redundancy 
group. For example, redundancy group RGO might be 
limited to a width of 8 spindles including logical drive 
positions 0-7, such as is shown in FIGS. 22 and 23, and 
a second redundancy group might be provided for logi- 
cal drive positions 8-11. 
It is also not necessary that the entire depth of the 
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group. In the example of FIG. 24, portions 810 and 811 
contain data structures reflecting the configuration of 
the parallel set. These data structures are described in 
greater detail below in connection with FIG. 25. In 
addition, any portion f memory space between set 
extents A, B and C such as the portions indicated by 
regions D and E in FIG. 24, may be excluded from 
redundancy group RGO. 

FIG. 24 further provides a graph 812 showing a lin- 
ear representation of the physical address space of the 10 
drive in logical position 0. Graph 812 represents a sec- 
tional view of address space chart 810 along line O'-O", 
and further illustrates the relationship of the various 
logical levels of the present invention as embodied in 
the exemplary parallel set configuration of FIG. 24. 15 

As stated previously, the parallel set can be config- 
ured by the operator initially at installation time and/or 
during run-time of the parallel set. The operator formats 
and configures the application units he desires to use by 
first determining the capacity, performance and redun- 20 
dancy requirements for each unit. These considerations 
have been previously discussed herein. Once the capac- 
ity, performance and redundancy requirements have 
been defined, the logical structure of the units can be 
specified by defining parameters for each of the logical 23 
layers (redundancy group layer, data group layer and 
application unit layer). These parameters are provided 
to a configuration utility program executed by proces- 
sor 51fa of second level controller 14. The configura- 
tion utility manages a memory resident database of 30 
configuration information for the parallel set. Prefera- 
bly, a copy of this database information is kept in non- 
volatile memory to prevent the information from being 
lost in the event of a power failure affecting the parallel 
set. A format utility program executed by processor 35 
516a utilizes the information in this database as input 
parameters when formatting the physical drives of the 
parallel set as directed by the operator. 

The basic parameters defined by the configuration 
database preferably include the following: 40 
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cuenl. Depth sod width together are the 
dimensions respectively of the tide and the top 
of the rectangle formed by each data group is 
shown to FIGS. 22-24. 

The name of the redundancy group to which the 
data group belongs. 

A name or number identifying the extern in 

which the dau group b located. 

The configuration utDiiy win assign a number 

to each dau group, unique within its redundancy 

group. This number will be used to identify the 

data group later, for the format utility and at 

run-time. 

The depth, in sectors, of logically contiguous 
blocks of data within each stripe of dau in the 
dau group. 



Redundancy 

Group: 

Extent 

Number: 

Index: 



Data 

Group 
Stripe 
Depth: 

3) For each application unit: 
Size: Size in sectors 

Data A list of the data groups, and their size and 

Group order, within the unit address space, and the 

List: base unit logical address of each data group. 

Each group is identified by the name of the 
redundancy group it is in and its index. 



1) For each redundancy group: 
Type: Mirrored; 

Two check drives: 

One check drive; 

No check drive. 
Width: The number of logical drive positions as 

spindles in the redundancy group. 
Extent For each extent of the redundancy group. 

Size: the size (depth) of the extent in sectors 

Extent For each extent of the redundancy group 

Base: the physical layer address of the first 

sector in the extent. 
Stripe For interleaved check drive groups, the depth, 

in Depth: sectors of a stripe of check dau. 

Drives: An identification of the physical drives 

included in the redundancy group. 
Name: Each redundancy group has a name that is 

unique across the mass storage system 500. 

2) For each data group: 



Base: The index (logical drive number) of the drive 

position within the redundancy group that is the 
first drive position in the dau group within 
the redundancy group. 

Width: The number of drive positions (logical drives) 

in the data group. This is the number of 
sectors across in the dau group address space. 

Sun: The ofTset. in sectors, within the redundancy 

group extent where the dau group rectangle 
begins on the logical drive position identified 
by the base parameter. 

Depth: The number of sectors in a vertical column of 

the dau group, within the redundancy group 
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FIG. 25 illustrates exemplary data structures contain- 
ing the above-described parameters that can be used in 
implementing the configuration database of a device set 
such as parallel set 501 or 502. These data structures 
may be varied as desired to suit the particular device set 
embodiment to which they are applied. For example, 
the data structures described hereafter allow for many 
options that may be unused in a particular device set, in 
which case the data structures may be simplified. 

The configuration database includes an individual 
unit control block (UCB) for each application unit that 
references the parallel set (a unit may map into more 
than one parallel set). These UCB's are joined together 
in a linked list 900. Each UCB includes-a field labeled 
APPLICATION UNIT # identifying the number of 
the application unit described by that UCB. Alterna- 
tively, the UCB's within link list 900 might be identified 
by a table of address pointers contained in link list 900 
or in some other data structure in the program memory 
of microprocessor 516a. Each UCB further includes a 
map 901 of the data groups that are included in that 
particular application unit. Dau group map 901 in- 
cludes a count field 902 defining the number of data 
groups within the application unit, a size field 904 defin- 
ing the size of the application unit in sectors, and a type 
field 906 that defines whether the linear address space of 
the application unit is continuous (relative addressing) 
or non-continuous (absolute addressing). A non-con- 
tinuous address space is used to allow portions of the 
application unit to be reserved for dynamic configura- * 
tion as previously described in connection with data 
groups (D4) and (D5) of FIG. 22. 

Data group map 901 further includes a data group 
mapping element 908 for each data group within the 
application unit. Each data group mapping element 908 
includes a size field 910 defining the size in sectors of the 
corresponding data group, a pointer 912 to a descriptor 
block 914 within a dau group list 916, a pointer 718 to 
an array control block 720, and an index field 721. The 
dau group mapping elements 908 are listed in the order 
in which the data blocks of each dau group map to the 
LBN's of the application unit. For example, referring to 
LUN1 of FIG. 24, the mapping element for dau group 
D3 would be listed before the data group mapping 
element for dau group D2. Where the address space of 
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the application unit is non-continuous, as in the case of 
LUN1 of FIG. 24, data group map 901 may include 
mapping elements corresponding to, and identifying the 
size of, the gaps between available ranges of LBN's. 

Data group list 916 includes a descriptor block 914 
for each data group within the parallel set, and provides 
parameters for mapping each data group to the redun- 
dancy group and redundancy group extent in which it is 
located. Data group list 916 includes a count field 717 
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the extent (in which case, the extent will have an equal 
number of data drives and redundant drives). Alterna- 
tively, a Reed-Solomon coding algorithm may be used 
to generate check data on one drive for each redun- 
dancy group stripe within the extent, or a more sophisti- 
cated Reed-Solomon coding algorithm may be used to 
generate two drives of check data for each redundancy 
group stripe. Type field 760 may specify also whether 
the check data is to be striped throughout the extent, 



:j„„ t - r • w ' r , : r, , r ' V,,W,A WBVU » lu =>"'pcu mrougnoui ine extent, 

identifying the number of descriptor blocks in the list. 10 and how it is to be staggered (e.g., the type field mieht 



In the case of a redundancy group having a striped 
check data configuration, each data group descriptor 
block 914 may include a "pqdeJ" field 722 that defines 
the offset of the first data block of the data group from 



index a series of standardized check data patterns, such 
as a pattern in which check data for the first redun- 
dancy group stripe in the extent is located on the two 
t u„ ^'{"''^r"X "Z "r; " r~ " V1 " numerically highest logical drive positions of the redun- 

the beginning of the check data for the redundancy 15 dancy group, check data for the second redundancy 
group stripe that includes that first data block. The group stripe in the extent is located on the next two 
value of pqdel field 722 may be positive or negative, numerically highest logical drive positions, and so on), 
depending on the relative positions of the drive on Yet another alternative is that type field 760 indicates 
which the first data block of the data group is config- that no check drives are included in the initial config- 
ured and the corresponding check data drives for the 20 ration of the redundancy group extent. This may be 

redundancy erouc Strine including thai fir«t Hota Wr^V A**l~-A r~- . ;r »i J . 



redundancy group stripe including that first data block. 
This value can be useful for assisting the second level 
controller in determining the position of the check data 
during I/O operations 
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Each data group descriptor block 914 also includes an 25 shown in FIG. 24, 



desired, for example, if the redundancy group extent is 
created for use by diagnostic programs. A redundancy 
group extent of this type was previously discussed in 
connection with extent A of redundancy group RG0 



index field 723 (same value as index field 721), a width 
field 724, a base field 726, an extent number field 727, a 
start field 728, a depth field 730, a data group stripe 
depth field 731 and a redundancy group name field 732 
that respectively define values for the corresponding 
parameters previously discussed herein. 

Array control block 720 provides a map of redun- 
dancy groups of the parallel set to the physical address 
space of the drives comprising the parallel set. Array 



30 



Each extent descriptor block 746 may further include 
a redundancy group stripe depth field 762 to specify, if 
appropriate, the depth of redundancy group stripes 
within the extent. 

List 744 of physical drive identifier blocks 745 in- 
cludes an identifier block 745 for each physical drive in 
the parallel set. Each identifier block 745 provides in- 
formation concerning the physical drive and its present 
operating state, and includes in particular one or more 



i i_i \ „^ . r , ° r awii£ .»u»ic, ajiu mtiuacs in particular one or more 

control block 720 includes an array name field 734 and 35 fields 764 for defining the logical position in the parallel 

nnp nr mnn» (IpMc that iminnAKi ;/t^ n «;r.. .... . -r ..... " 



one or more fields 735 that uniquely identify the present 
configuration of the parallel set. Array control block 
720 also includes a list of redundancy group descriptor 
blocks 736. Each redundancy group descriptor block 
736 includes a redundancy group name field 738 identi- 
fying the redundancy group corresponding to the de- 
scriptor block, a redundancy group width field 740 and 
a redundancy group extent map 742. Array control 
block 720 further includes a list 744 of physical drive 
identifier blocks 745. 

For each extent within the redundancy group, redun- 
dancy group extent map 742 includes an extent descrip- 
tor block 746 containing parameters that map the extent 
to corresponding physical address in the memory space 
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set of the corresponding physical drive. 

To summarize briefly the intended functions of the 
various data structures of FIG. 25, the unit control 
blocks of link list 900 define the mapping of application 
units to data groups within the parallel set. Mapping of 
data groups to redundancy groups is defined by data 
group list 916, and mapping of redundancy groups to 
the physical address space of the memory of the parallel 
set is defined by array control block 720. 

When each physical disk of the parallel set is format- 
ted by the formatting utility, a copy of the array control 
block 720, link list 900 and data group list 916 are stored 
on the drive. This information may be useful for various 
operations such as reconstruction of a failed drive. A 



r C ii i r ' * 7 ™™ "* upcrtiuuro micij as reconstruction oi a tailed drive. A 

of the parallel set, and define the-configuration of re- 50 copy of the configuration database also may be written 
dundant information in fh#» ^vtpnt Ac an «>v€imnlA «_ #u -....^ii r *i *. . . . 



dundant information in the extent. As an example, ex- 
tent descriptor blocks are shown for the three extents of 
redundancy group RG0 of FIG. 24, each extent de- 
scriptor block including an extent number field 747 and 
base and size fields defining the physical addresses of 55 
the corresponding extent. Application data base and 
size fields 748 and 750 correspond respectively to the 
base and size of extent B of redundancy group RG0; 
diagnostic (low) base and size fields 752 and 754 corre- 



to the controller of another parallel set, such that if one 
parallel set should fail, another would be prepared to 
take its place. 

During each I/O request to a parallel set, the map- 
ping from unit address to physical address spaces must 
be made. Mapping is a matter of examining the configu- 
ration database to translate: (1) from a unit logical ad- 
dress span specified in the I/O request to a sequence of 
data group address spans; (2) from the sequence of data 



~ , . , . . ~ . s ,wu k ouu,CTJ »pai»i \t) irom ine sequence ol data 

spond respectively to the base and size of extent A of 60 group address spans to a set of address spans on loeical 
redundancy erouD RG0: and diapnmtir (h\oh\ haw anH ,..:.ui j j . . & 



redundancy group RG0; and diagnostic (high) base and 
size fields 756 and 758 correspond respectively to the 
base and size of extent C of redundancy group RG0. 

Each extent descriptor block 746 also includes a type 
field 760 that defines the type of redundancy imple- 
mented in the extent. For example, a redundancy group 
extent may be implemented by mirroring or shadowing 
the mass storage data stored in the data group(s) within 



drive positions within a redundancy group; and then (3) 
from the set of address spans on logical drive positions 
to actual physical drive address spans. This mapping 
process can be done by having an I/O request server 
65 step through the data structures of the configuration 
database in response to each I/O request. Alternatively, 
during initialization of the parallel set the configuration 
utility may, in addition to generating the configuration 
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database as previously described, generate subr urines start field 728 will indicate that there is an offset of 10 

for the I/O request server for performing a fast map- sectors on 1 gical drive 0 between the beginning of 

ping function unique to each data group. The particular extent B and the first data block of data group D3 

manner in which the I/O request server carries out the Knowing the logical drive position and extent ffset 
mapping operations is implementation specific, and it is 5 of the first data block of the data group, the I/O request 

believed to be within the skill of one in the art t imple- server then determines the 1 gical drive positi 7 and 

ment an I/O request server in accordance with the extent offset for each sequence of datablockVin the data 

present invention as the invention is described herein. group corresponding to the LBN's of the I/O request. 

The following is an example of how the I/O request To do ^ ^ j/q request server may use the values of 

server might use the data structures of FIG. 25 to map 10 ^ ficW 724 , dcp ^ f|dd 730 ^ 

from a logical umt address span of an application I/O dcpth field 731 . If ^ chcck data is included vSfS 

2f' l l a Sp ?"° r ? p £ s w»^/he physical address rc £ angular boundaries oftht ^ ^ th ^^n 

The I/O request server determines from the I/O fi^ZJ^^^ C0B,rel > l0ck 720 

request the application unit being addressed and *°£P» rt ^ 

whether that application unit referent the parallel set. tlTT? 1 TT ° f any ChC< * 

This latter determination can be made by examining link 20 J^^^ *™* by e%amm ' 

list 900 for a UCB having an APPLICATION UNIT # mg u r ^« redundancy group stripe 

corresponding to that of the I/O request. If an appropri- , P . * Cld 7 6 * "PPropn^c redundancy group 

ate UCB is located, the I/O request server next deter- ? tcnt ^"P? 0 ' bIock ™ <*e I/O request server can 

mines from the LBN(S) specified in the I/O request the determine which extent descriptor block 746 is appro- 
data group or data groups in which data block(s) corre- 25 pnate by fmdmg t J? e ,?L tcnl descri P tor Mock 746 having 

spending to those LBN(S) are located. This can be ? n cxlcnl nu mber field 747 that matches the correspond- 

accomplished by comparing the LBN(S) to the size lng cxtcm numbcr ficld 727 in the data group's descrip- 

fields 910 of the mapping elements in data group map tor block 914) * 11)6 1/0 rc qu«t server is directed to 

901, taking into account the offset of that size field from ***** contT0 } Wock 720 by the pointer 718 in the data 
the beginning of the application unit address space (in- 30 grou P m *PP™S clement 908. 

eluding any gaps in the application unit address space). To tramiate each logical drive position and extent 
For example, if the size value of the first data group c l frset addrcss s P a " to a physical address span on a par- 
mapping element in map 901 is greater than the LBN(s) ticular physical drive of the parallel set, the I/O request 
of the I/O request, then it is known that the LBN(s) server rcads thc P*y»caJ drive identifier blocks 745 to 
correspond to data blocks in that data group. If not, 35 determine the physical drive corresponding to the iden- 
then the size value of that first mapping element is tificd lo g>cal drive position. The I/O request server also 
added to the size value of the next mapping element in reads the Dasc fie,d of the appropriate extent descriptor 
map 901 and the LBN(s) are checked against the result- b]ock 746 of array control block 720 (e.g., application 
ing sum. This process is repeated until a data group is base fieId 75 2), which provides the physical address on 
identified for each LBN in the I/O request. 40 tne drivc of the beginning of the extent. Using the extent 

Having identified the appropriate data group(s), the offset address span previously determined, the I/O re- 

I/O request server translates the span of LBN's in the °. uest server can then determine for each physical drive 

I/O request into one or more spans of corresponding f ne s P an °f physical addresses that corresponds to the 

data block numbers within the identified data group(s). identified extent offset address span. 

The configuration utility can then use the value of index 45 II occur that during operation of a parallel set 

field 921 and pointer 912 within each mapping element one or more of the physical drives is removed or fails, 

908 corresponding to an identified data group to locate sucn that the data on the missing or failed drive must be 

the data group descriptor block 914 in data group list reconstructed on a spare drive. In this circumstance, the 

916 for that data group. The I/O request server uses the configuration of the set must be changed to account for 
parameters of the data group descriptor block to trans- 50 the new drive, as well as to account for temporary set 

late each span of data block numbers into a span of changes that must be implemented for the reconstruc- 

logical drive addresses. tion period during which data is regenerated from the 

First, the I/O request server determines the logical missing or failed drive and reconstructed on the spare, 
drive position of the beginning of the data group from It is noted that the configuration utility can be used to 
the base field 726 of the data group descriptor block 55 remap the set configuration by redefining the parame- 
914. The I/O request server also determines from fields tcrs of the configuration database. 
732 and 727 the redundancy group name and extent In general, to those skilled in the art to which this 
number in which the data group is located, and further invention relates, many changes in construction and 
determines from start field 728 the number of sectors on widely differing embodiments and applications of ftic 
the drive identified in base field 726 between the begin- 60 present invention will suggest themselves without de- 
ning of that redundancy group extent and the beginning parting from its spirit and scope. For instance, a greater 
of the data group. Thus, for example, if the I/O request number of second level controllers and first level con- 
server is reading the descriptor block for data group D3 trailers may be implemented in the system. Further, the 
configured as shown in FIG. 24, base field 726 will structure of the switching circuitry connecting the sec- 
indicate that the data group begins on logical drive 65 ond level controllers to the disk drives may be altered 
position 0, redundancy name field 732 will indicate that so that different drives are the primary responsibOity of 
the data group is in redundancy group RG0, extent field different second level controllers. Thus, the disclosures 
727 will indicate that the data group is in extent B, and and descriptions herein are purely Hlustrative and not 
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intended to be in any sense limiting. The scope of the 
invention is set forth in the appended claims. 
What is claimed is: 

1. A system for storing data received from an external 
source, comprising: 

at least two control means for providing control of 
data flow to and from the external source; 

a plurality of storage means coupled to said at least 
two control means wherein said storage means are 
divided into groups and each group is controlled 
by said at least two of said control means such that 
in the case that a first control means coupled to a 
particular group of storage means fails, control of 
said particular group is assumed by a second con- 
trol means; 

a plurality of data handling means coupled to said at 
least two control means for disassembling data into 
data blocks to be written across a group of said 
storage means; and 

error detection means coupled to said control means 
and said storage means for calculating at least one 
error detection term for each group of storage 
means based on the data received from the external 
source using a selected error code and providing 
said error detection term to be compared with data 25 
to detect errors, said error detection means being 
coupled to each of said control means to receive 
the data from said control means and transmit said 
error detection term to an error code storage 
means in said group of storage means, 30 

2. The system of claim 1 wherein said data handling 
means further include assembly means for assembling 
said data blocks received from said control means. 

3. The system of claim 1 further comprising a first bus 
for coupling to said external source and a plurality of 35 
buffer means coupled to said first bus and to said control 
means for buffering data received by and transmitted 
from the system. 

4. The system of claim 3 further comprising error 
correction means coupled to said error detection means 40 
for correcting error in data as said data is transmitted 
from either said buffer means to said storage means 
through said data handling means or from said storage 
means to said buffer means through said data handling 
means. 

5. The system of claim 3 wherein the error detection 
means uses a Reed-Solomon error code to detect errors 
in said data received from said buffer means and said 
storage means. 

6. The system of claim 3 wherein said data handling 50 
means includes detachment means coupled to said error 
detection means for detaching from the system storage 
means and buffer means which transmit erroneous data 
responsive to receiving said error detection term from 
said error detection means. 

7. The system of claim 1 wherein said plurality of 
storage means comprises: 

a first group of data storage means for storing data 

from the external source; and 
a second group of error check and correction (ECC) 60 

storage means for storing ECC data generated by 

said error detection means. 

8. The system of claim 1 wherein each of said plural- 
ity of storage means stores data and error check and 
correction (ECC) data in a predefined pattern. 

9. In a system including at least two control means for 
communicating with an external source and a plurality 
of storage means wherein at least two of the control 
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means are c nnected to each of the storage means, a 
method for storing data received from the external 
source c mprising the steps of: 
receiving data from the external source; 
configuring the plurality of storage means into 
groups wherein each group is initially controlled 
by at least two of the control means such that in the 
case that one of the control means fails, the storage 
means of each group is accessible through another 
one of the control means; 
disassembling data into groups of data blocks to be 

written to said plurality of storage means; 
calculating at least one error detection term from said 

data using a selected error code; 
storing said data blocks in a first of said groups of 

storage means; and 
storing said at least one error detection term in said 
first of said groups of storage means. 

10. The method of claim 9 further comprising the 
steps of: 

retrieving said data blocks from said first of said 

groups of storage means; 
calculating a check error detection term from said 

data blocks using a selected error code; 
retrieving said at least one error detection term from 

said first of said groups of storage means; and 
comparing said check error detection term to said at 
least one error detection term to determine that 
said data has not been corrupted. 

11. The method of claim 10 further comprising the 
step of correcting said data if it is determined that said 
data has been corrupted. 

12. The method of claim 10 further comprising the 
step of assembling said data blocks into a form in which 
it was received from the external source. 

13. The method of claim 9 wherein the step of config- 
uring further comprises the step of setting a plurality of 
switching means to allow said data blocks to be passed 
between the control means and the storage means in a 
predefined pattern. 

14. The method of claim 9 further comprising the step 
of detaching a particular storage means upon which said 
data was stored if it is determined that said data has been 

45 corrupted. 

15. A system for storing data received from an exter- 
nal source, comprising: 

control means for providing control of data flow to 

and from the external source; 
a plurality of storage means coupled to said control 
means wherein said storage means are divided into 
groups; 

a plurality of data handling means coupled to said 
control means for disassembling data with data 
blocks to be written to said storage means; and 
error detection means coupled to said control means 
for receiving said data blocks in parallel form and 
detecting errors in each data block substantially 
simultaneously as said data blocks are written, to 
said storage means. 

16. The system of claim 15 further comprising data 
correction means coupled to said error detection means 
for correcting corrupted data in response to an error 
detection signal provided by said error detection means. 

17. The system of claim 16 wherein said plurality of 
storage means comprises: 

a first group of data storage means for storing data 
from the external source; and 
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a second group of error check and correction (ECC) 
. . storage means for storing ECC data generated by 
said error correction means. 

18. The system of claim 16 wherein the error detec- 
ti n means uses a Reed-Solomon error code to detect 
errors in said data received from said buffer means and 
said storage means. 

19. The system of claim 16 wherein the error correc- 
tion means uses a Reed-Solomon error code to correct 
errors in said data received from said buffer means and 
said storage means. 

20. The system of claim 15 further comprising de- 
tachment means coupled to said error detection means 
for detaching a particular storage means from the sys- 
tem which has provided corrupted data as determined 
by said error detection means. 

21. The system of claim 15 wherein said data handling 
means further includes assembly means for assembling 
said data blocks received from said control means. 

22. The system of claim 15 further comprising a first 
bus for coupling to said external source and a plurality 
of buffer means coupled to said first bus and to said 
control means for buffering data received by and trans- 
mitted from the system. 

23. The system of claim 15 wherein each of said plu- 
rality of storage means stores data and error check and 
correction (ECC) data in a predefined pattern. 

24. In a system including control means for communi- 
cating with an external source and a plurality of storage 
means, a method for storing data received from the 
external source comprising the steps of: 

receiving data from the external house; 



10 



15 



20 



25 



30 



disassembling the data int groups f data blocks to 

be written to said plurality of storage means; 
calculating at least ne error detection term for each 

data block substantially simultaneously; and 
storing said data blocks and said at least one error 

detection term in a first of said groups of storage 

means substantially simultaneously. 

25. The method of claim 24 further comprising the 
steps of: 

retrieving said data blocks from said first of said 

groups of storage means; 
calculating a check error detection term from said 

data blocks using a selected error code; 
retrieving said at least one error detection term from 

said first of said groups of storage means; and 
comparing said check error detection term to said at 

least one error detection term to determine that 

said data has been not corrupted. 

26. The method of claim 25 further comprising the 
step of correcting said data if it is determined that said 
data has been corrupted. 

27. The method of claim 25 further comprising the 
step of assembling said data blocks into a form in which 
it was received from the external source. 

28. The method of claim 24 wherein the step of con- 
figuring further comprises the step of setting a plurality 
of switching means to allow said data blocks to be 
passed between the control means and the storage 
means in a predefined pattern. 

29. The method of claim 24 further comprising the 
step of detaching a particular storage means upon 
which said data was stored if it is determined that said 
data has been corrupted. 
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